Sunday, November 13, 2016

HGP Counterfactuals, Part 3: BAC Sequencing Strategies

I introduced this seven-part series with an exploration of the values and challenges of counterfactual histories.  Yesterday, I looked at the "forgotten maps" which laid out the genome ready-to-sequence as a minimum tiling set of BACs, made sure that those BACs faithfully represented the genome and that this set was tied at regular intervals to the genetic markers and cytogenetic locations which were the linga franca of human geneticists.  Today, I'll look at strategies that were considered for sequencing all those BACs.
So you have a lot of BACs to sequence.  How do you go about sequencing them?  Good BACs are typically in excess of 100Kb, sometimes more.  The YACs that were originally planned as sequencing targets, but proved to be too prone to scrambling via recombination between repeats, could be around half a megabase in size.  The first time I attended AGBT, in 1992, very few contiguous sequences of this scale had been completed.  Sometimes around 1994/95 I compiled a list of all the 100Kb sequence accessions, which I called "The 100kb Club" and put it out on this crazy new World Wide Web.  I didn't keep copies, and sometime before I defended in 1997 I gave up, but I think that original list had perhaps a dozen entries.  Efficiently sequencing BACs on a grand scale had never been attempted; even the number of fully sequenced cosmids (which are in the 40-50Kb range) was relatively modest.

Sequencing projects were generally viewed as having two phases: an expensive early phase which covered most of the targeted, followed by a more expensive finishing phase.  Cost per base was high with all existing technologies (which I'll cover in the next installment).  Cost per base was even more expensive in the finishing phase, as this required generating specific oligonucleotide primers to cover the targeted region.  Also driving up costs were the coverage requirements.  It was generally accepted that having at least one read on each strand was a requirement, as many electrophoretic artifacts will be different depending on which strand was being read  (and as we will see in the sequencing technologies installment, effectively all of the sequencing technologies in play used electrophoresis).  To break ties for non-strand specific errors, a third read was desired.  Sometimes this would also be sufficient to resolve strand issues, as these could be more pronounced in the late parts of the read, in which the bands were closer together.

Now, BACs are large.  One question was how to break them up for sequencing.  A number of vectors were available, each with a characteristic insert size.  M13 clones behaved very well, but were mostly useful for sequencing in a single direction.  pUC-type plasmid vectors could reliably hold inserts of several kilobases, with reads from each end potentially contributing a distance constraint.  Phage lambda clones could be generated with inserts of perhaps ten to twenty kilobases and again the option of sequencing each end.  With various tricks, one could imagine scaling inserts to almost any size.  For example, the total length of lambda DNA is constrained by the size of the phage head.  Depending on the design of the lambda vector, one could leave varying amounts of space for the insert.  The packaging machinery also conveniently sets a lower limit; unless the phage head is reasonably full, it isn't viable.

For the initial phase, there were two major concepts on the table: shotgun and directed.  

Shotgun approaches would shatter an entire clone and sequence pieces of this.  Not only could the original BAC be shotgunned, but if desired clones made from it in other vector types could be shotgunned.  Fragments are effectively drawn randomly from the input, so some regions will end up with excess coverage and others with too little or none.  The latter will require finishing.  Also, shotgun approaches will waste reads sequencing the vector, as well as background reads originating from contaminating E.coli host DNA.  Indeed, the genome of E.coli DH10B was generated from the host contamination in the bovine sequencing project (a concept that was floated at an AGBT as the "schmutz genome project"). Several different strategies for shotgunning were developed, such as physical shearing and transposon-based methods.  

Directed approaches attempted to "walk" across the clone.  Some sequencing approaches were inherently directed, but most were not.

Given that sequencing was seen as expensive and that shotgun sequencing would both over-sample some regions and under-sample others, some groups proposed converting shotgun libraries into something more resembling directed sequencing.  After making the libraries, the clones would be mapped using the same sorts of methods being used to map BACs. Then a tiling set of sequencing clones could be chosen to satisfy the final sequencing coverage requirements.  If successful, the amount of "wasteful" sequencing in excess of the coverage goals could be minimized and few if any finishing reactions would be required.

Now some labs took this logic further, proposing multiple levels of this fragment-and-map strategy.  BACs could be broken into lambda clones and the lambda clones mapped into a minimum tiling set.  The lambda clones could then be broken into pUC clones and mapped into a minimum tiling set.  pUC clones would then be fragmented into M13 clones and those organized into an efficient covering for the final sequencing.  It's been a while; I can't remember if any of these groups proposed more levels.   Coming from a lab that disdained mapping, I quickly started thinking of this strategy as "map into the ground".

I stated before that sequencing was expensive.  However, these strategies replaced expensive sequencing with lots of library generation and cloning steps. Furthermore, each clone here must be picked by a robot, grown and prepped for DNA.  Plus, all of those clones and libraries require tracking to make sure the recursive logic can be used in assembly, not to mention actually retrieving that final tiling set of clones.  This also illustrates an expense of finishing operations; tracking is key so the correct primers are applied to the correct templates.

Each major sequencing lab initially chose a strategy and built for it.  So if a lab was focused on shotgunning the original BACs, it invested on that much simpler task.  Mapping-focused labs, which unsurprisingly tended to be the labs involved in the physical mapping efforts, spent much of their effort on getting all that recursive mapping working. 

Note that one of these strategies is truly focused on the goal of sequencing the genome.  The map-into-the-ground strategies spend a lot of effort trying to save on sequencing by doing a lot of non-sequencing activities.  With the original physical map generation, the tiling set of BACs was seen as a valuable asset, enabling further functional experiments.  In contrast, any subclones generated to support sequencing weren't seen as useful; most would be smaller than a gene and in any case the overhead of tracking and storing all of them would be enormous.

It would require a proper historian of the genome project to analyze how quickly mapping-heavy strategies died, but die they did.  As more-and-more sequencing was performed, cost improvements in sequencing accrued.  Similar economies for the mapping did not.  The whole notion that sequence generation must be ultra-efficient diminished each time the cost of sequence generation dropped.  The coffin map-into-the-ground was certainly nailed shut when Craig Venter's team at TIGR showed that an entire multi-megabase bacterium could be assembled by the shotgun approach.  Venter would then go on at Celera to shotgun and assemble the human genome.

It is worth noting that even after Venter's success though that many model organism projects continued using the strategy of first generating a tiling set of BACs or cosmids and then sequencing those cosmids, even well into at least the middle of the 2000s. It would be interesting to go interview the leaders of those efforts to see why they did so.  Inertia?  Fear that their organism would not work well?  Desire for the BACs/cosmids as components of future experimental strategies?

Today, shotgunning has almost utterly taken over.  Modern sequencers generate enormous amounts of data at very low cost-per-base, whereas library construction has dropped in cost very slowly.  Long-range constraints on genomes are largely compiled via methods suitable for these sequencers, with mate-pair libraries on the wane but long read sequencing, linked read sequencing and Dovetail-type strategies dominating.  As noted yesterday, for most genomes physical maps are limited to restriction maps generated with BioNano Genomics or similar technologies.  Ordering of fragments is now entirely done in software and clone pickers no longer fill sequencing laboratories.  But the story of sequencing technology belongs to the next installment, so I'll save any more discussion for then.


No comments: