Thursday, June 16, 2016

Writing Big

For the past few decades, since the early stirrings that led to the Human Genome Project, technology for reading DNA has gotten immense investment and attention.  Particularly noted, perhaps ad naseum, is the rapid decline in sequencing costs which outpaces Moore's Law for microprocessors.  The converse activity of writing DNA has generally played second fiddle, but these past few weeks have seen a flurry of headlines on that topic.

The one bit of news is that Boston synthetic biology startup Ginkgo Bioworks raised an additional $100 million dollars in a Series C financing.  Ginkgo has been using synthetic biology to generate industrial compounds, particularly in the area of flavors and fragrances.

If you ever get a chance to visit their digs, do so - they throw great parties corresponding to synthetic biology conferences hosted locally. On the human side, they're in a still very industrial portion of Boston's booming Seaport District, overlooking the cruiseport on one side and an active drydock on the other. I once saw the Queen Elizabeth II in the dry dock (she ran aground off Martha's Vineyard), and have seen her successor the Queen Mary II, which is nice symmetry.  Convenient also to Logan Airport, via the Ted Williams Tunnel -- the sections of which I also once saw docked in the same channel as the cruiseport.

That sort of industrial vibe fits well with Ginkgo, as they have been serious about industrializing synthetic biology.  Not only do they have an impressive fleet of liquid handling robots, but they have the LIMS to track every movement of every liquid.  So when things go wrong, troubleshooters have a wealth of information to sleuth out bad reagents or processes.

Undoubtedly some of Ginkgo's new investment haul will go to new machines and personnel, but a large fraction was immediately committed to DNA.  Ginkgo announced new arrangements with both of the major players in gene synthesis, Gen9 and Twist Bioscience, and each for 300 megabases of DNA at around 3 cents a base .  That's right, Ginkgo has just committed nearly a fifth of that fund raising to buying over a half gigabase of DNA.  More than a few genomes of macroscopic eukaryotes are smaller than that!  This is serious construction!!

What is Ginkgo going to do with all that?  Well, given that Codon was engaged in similar thinking in its terminal phases, I have a decent general idea -- and it is a lot of fun!  Take all the known biochemical reactions in the world, and then try to draw an energetically favorable biochemical path from typical central metabolic molecules to the molecule of interest.  You'll start with available biochemical pathway databases, but they are guaranteed to be incomplete so you'll also do some serious PubMed trawling to find interesting enzymes.

Now, at this stage you are probably in one of three states.  The worst would be no plausible reaction path, in which case you give up.  For example, as useful as it would be to enzymatically synthesize  diamonds, there is probably not a route (no, I haven't rigorously explored this!).  At the other extreme is the case where there is clearly a plausible pathway that arrives at your destination.  The trivial example of this is that the pathway exists in nature and has been elucidated, but there are many more interesting cases in which mixing and matching enzymes from different pathways -- probably from different organisms -- generates the appropriate path.  Of course, you'll need to check for things like compatibility with redox states and aerobic vs. anaerobic and such.

The third, messy possibility is that there isn't a path, but there are clearly near misses.  Some key pieces of your pathway -- most likely the final steps but they could be anywhere -- are missing, but enzymes that catalyze almost the right reaction are known.  So, given the vast number of likely orthologs for those enzymes that are available from Genbank, you can envision synthesizing and screening huge numbers of them.  Now, that does assume that you have a reliable biochemical or genetic screen for activity -- if you don't you should give up (one cool project at Codon foundered because the assay was utterly unreliable).  

So in Case 3 you're going to want to synthesize a lot of homologous enzymes.  Actually, you'll probably want to do that also in Case 1 -- you never know up front who is going to perform best.  Once you get a pathway roughed out and demonstrated to work, you're going to do a lot of searching for the correct balancing of the expression levels of the genes, and you may also want to explore mutagenesis of your co-opted enzymes to increase efficiency by reducing side product production.  You may also prove out the system initially in a very facile host such as E.coli, but later move on to a preferred vehicle.  Clearly abundant opportunities to design and create a lot of DNA.

Companies such as Ginkgo fit well into current gene synthesis capabilities.  These companies are getting pretty good at synthesizing typical gene-sized fragments, within certain bounds of %GC and particularly if you can codon optimize (as I've probably said before, codon optimization is often far more about making genes easy to make than it is about making them produce well).  There are numerous techniques for stitching those gene-sized (1-2Kb) pieces together into larger constructs, especially if you don't mind some "seams" in-between.  Many techniques for building combinatorial arrays of pathways need such seams, so it all works well.  Pick an organism like E.coli with a nicely balanced G+C content, or perhaps Saccharomyces which is low but a lot of effort has been invested, and you are generally golden -- though the fact that Ginkgo is using both suppliers is a wise caution, since the grapevine says both have had periodic hiccups in the production lines (a similar hiccup at Codon was definitely the beginning-of-the-end of "retail" gene synthesis there).

So what is hard in gene synthesis?  High G+C.  Really, really big constructs without seams.  So only a fool would pick such to work on -- unless nature decreed it.  So if you want to build polyketide antibiotic clusters for Actinomyces like Streptomyces, things are hard.  Many of your constructs are rejected simply for high (~73%) G+C, and codon optimizing away from that is essentially impossible -- the codon usage table nearly always says "C or G in third position, please".  Plus even slightly A+T rich codons like TTG for leucine are shunned in favor of CTG or CTC.  So getting even small genes made -- even promoter blocks -- is often hard to get past the synthesis company filters.  Now try to make a 40Kb gene with a large number of multi-kilobase repeats.  No seams allowed there, and a lot of fun building.  

Worse, no matter what you're building you need to verify the results -- and Illumina short reads are the dominant method.  Sanger has just gotten too expensive, and long read techs are not yet attractive for similar reasons.  So don't paint yourself in a corner is a watchword: at Codon we belatedly discovered one construct we could make but not easily prove, as it had a tandem repeat larger than the span of two Sanger reads.

There are also special classes of sequence which can be difficult to synthesize.  Active origins of replication are a major class, which means getting completely redesigned vectors built can be a challenge.  Perhaps a solution to this has now appeared in the Church lab promoting a Vibrio as a cloning host; since many E.coli vectors don't seem to function here (at least that is implied), it could be a useful system for building such vectors.   But there are also problems with unexpectedly toxic sequences.  There was one bit for an internal project at Codon that kept failing in the cloning step.  We farmed it out to another gene synthesis company, which similarly failed (and you only pay on success!).  I can't remember if we inflicted it on a third company, but I did have this vision of this diabolical sequence marching across the landscape of synthesis companies, bankrupting each one in turn.

Now, even if you can DNA for as inexpensively as Ginkgo is getting it, and you need to order large to do so, it is still not cheap.  For example, at Infinity (devastating recent news there, but that's for another post) I once wanted to run a cDNA screen using either human or mouse cDNA library (I forget which; it was an important point then but not for this example)-- but with some clever barcoding to enable a high throughput sequencing-based readout.  But commercially available ORFeome libraries lack the necessary barcoding.  So let's go build the library!  Let's ballpark it: 25K ORFs times 2Kb per ORF times 0.03 dollars per base -- only $1.5M!  So I needed at least two orders of magnitude improvement in synthesis costs to make this crazy scheme remotely financially plausible.

What makes synthesis so costly?  I'm not an expert, but all synthesis schemes today ultimately trace back to phosphoramidite oligonucleotide synthesis chemistry.  These reagents are exquisitely sensitive to traces of water, so all the reagents in the synthesis must be extremely pure.  This is a typical solid phase synthesis scheme, in which synthesis cycles (add, cap any ends that failed to extend, remove blocking groups -- with washes in-between key steps) build up the molecule of interest.  Your ultimate yield of product is the efficiency of that cycle raised to the power of the length of the oligo, which really makes you appreciate the genius that has gone into driving these chemistries to generate 200 mers in any quantity.

Both Gen9 and Twist have miniaturized and palatalized the synthesis steps so that small amounts of the precious phosphoramidites can be used to make huge numbers of oligos (Gen9 I believe is using inkjet printing technology via their partner Agilent; Twist uses microengineered silicon wells but I don't know how they direct reagents to them).  Only tiny amounts of the correct oligos are made, and they are in a sea of other oligos made on the same device, so each company has perfected means for retrieving and amplifying the correct oligos for a given project.  That's all amazing stuff, but one can't help but wonder how much further this particular technology can be driven.  Could the two further logs I wanted be plausible?  

What alternatives are there?  Bigger building blocks would be one route.  Trinucleotide phosphoramidites can be found on the Internet, but for any machinery that complicates the plumbing involved, particularly if you wanted to support all 64 possibilities -- and that only triples the length of sequence that can be initially built.  These are really most valuable for amino acid randomization, as with mononucleotide phosophoamidites the coupling inefficiencies are not equal, so it can be difficult to achieve balanced codon frequencies.  Enzymatic synthesis could be an interesting route -- especially by getting rid of the need for phosphoramidites and the nasty and expensive solvents they require (a Millennium office was once flooded with oligonucleotide synthesizer waste from the lab upstairs, due to a missing collection vessel; almost everything in that office -- furniture, computers, etc. -- ended up in a waste dump somewhere).  

Clive Brown of Oxford Nanopore announced such an initiative, and that could prove interesting given that the sorts of microfluidics and electronic DNA manipulation and sensing ONT has been developing could be very valuable for directing gene fragments about and verifying them.  But certainly this is a hard problem, and ONT took a long time to get nanopore sequencing off-the-ground and broadly working, so one can't be impatient.  I've heard of others exploring enzymatic approaches as well.  One can hope, but these could be very distant dreams.

I haven't digested it yet, but the other huge splash (and no small controversy) recently in the synthetic biology space is a proposal to work towards synthesizing a human genome.  Certainly this fits into a grand scientific tradition that says you don't understand something until you can build it from scratch, but I'm not convinced yet this is really a great goal.  While the proponents I believe are saying they would not try to actually generate a fully realized human from such as genome, it is hard to believe that this won't be attempted at least by some renegade, and the ethical implications are decidedly non-trivial.  Personally, I can think of a more biologically interesting mammalian genome to build if you are going to generate the organism, but I've already gone very long so I won't let that tail of this idea wag the rest of the post. But in any case, even a haploid human genome is still five times bigger than what Ginkgo is ordering, and will certainly contain many of the problems I list above as well as worse ones (telomeres and centromeres are utter nightmares to contemplate building -- heck, we don't actually know the complete sequences yet!).  


Marco Mena said...

Ah, the Codon memory lane! Thanks for a great post. It's interesting to see the field continue to evolve.

Jonathan Jacobs said...

FWIW - there's an industry day being hosted next week by IARPA that is somewhat directly related to this: FUN GCAT