Omics! Omics!: AI Agent -Driven Biochemical Optimization

When I started this space, I vowed not to become just a mouthpiece of my employer - but when my employer does something interesting, of course I'm not going to reflexively avoid writing about it. On the autonomous labs front, there's an interesting preprint showing how an intelligent agent interacting with an autonomous laboratory system can generate new knowledge. I'd like to think I would have covered it anyways, but since it is the autonomous laboratory at my workplace, I'm especially eager to do so. In a collaboration between Ginkgo and OpenAI, an agent was used to optimize a cell-free protein synthesis reaction using our autonomous laboratory based on Reconfigurable Automation Cart (RAC) technology. This isn't the first time this has been done, but I think it has a number of interesting elements. Plus, optimizing complex biochemical reactions is a broadly interesting area to explore. And, thinking up new services running on our autonomous laboratory is exactly my bailiwick, and I certainly reserve the right to use this space to be a mouthpiece of myself!

Cell-free protein synthesis has been around for many years, offering a number of useful advantages over expressing proteins in cells. The most obvious is that since it is cell-free, many modes of toxicity are avoided. It also doesn't require lysing cells and the cell-free translation system preparation system has already removed some materials that would be purified away anyways. Nor does it require transforming cells - you just put your DNA constructs into the soup. So very automation friendly and potentially very fast. Downsides include higher cost, lower yields, and perhaps losing important post-translational modification processes.

Cell-free translation mixes are mostly made from cells through a series of extraction processes. You can make them from any sort of cell, so there are E.coli, wheat germ, HeLa - you name it - cell-free translation systems. A Streptomyces cell-free translation system was just published - Warp Drive has permanently left virtual spores of those taxa in my brain. Or, there is the PURE system, which uses a defined mixture of recombinant proteins - which is in itself a remarkable feat.

The Ginkgo-OpenAI team used an E.coli-based cell-free translation system which is a product of ours. To the extract you can add all sorts of different compounds at all sorts of concentrations - buffers, salts, nucleotides, nucleosides, energy sources, etc. My thinking about the size of that parameter hyperspace rapidly brings on a headache. And what are you optimizing for? Cost or produced protein titer? There is also a question of scale - this publication explored the parameter space at 20 microliters in 384-well plates reaction scale and then executed some testing at larger scales.

A time-tested approach to exploring such a complex parameter space is formal Design of Experiments aka DoE. Roughly (I'm not a skilled practitioner) it means treating all your parameters as linear variables (perhaps after log-transforming), creating a matrix of initial conditions, measuring the output variable after that first run, and then using multi-variable linear modeling techniques to estimate the contribution of each factor and design a new matrix to hone the understanding of effects. I'm sure there is some way to deal with interactions between variables, but like I said I'm not expert in DoE - which is probably a sufficiently rich domain that one would need to study and practice it for many years to be able to honestly call yourself one.

OpenAI of course has other ideas how to approach this. A machine learning model was used to design the initial matrix and then results of each round were fed back to the model - a lab-in-a-loop. The goal of the experiment was to optimize the reaction cost for producing at least a minimum amount of superfolder GFP (sfGFP). The team then tackled two key challenges of using machine learning models.

First, the problem of hallucinations. Ideally the model has flexibility to explore, but not problematic flexibility. No ordering reactions with unobtanium, plutonium or antimatter. No requesting a TARDIS to enable more volume in a real-world physical space of 20 microliters. So the model was given a constrained list of molecules which could actually be sourced it could use. It was also, via a Python library called Pydantic, given constraints on how much it could add of each. After three rounds of experimentation, the model was sent out into the literature with the instruction to identify further compounds to be added to its list of those available. After confirming these could be sourced, they were added to the experimental scope.

Second, the problem of interpretability. Machine learning models tend towards being extreme black boxes. So this agent was required to write human-readable lab notebook entries describing its choices.

So best laid plans. Luckily not much gang agley - but enough to be a reminder that these scientist-like agents are much like a fresh first year graduate student or undergraduate researcher - eager, thinking they know it all, and unburdened by actual experience. Twice the agent designed useless plates. Once due to that all-too-common trap of completely botching a unit conversion. Another time it found weaknesses in the Pydantic constraints and blitzed right through them - trying to overfill the wells.

The end result was a 40% reduction in reaction cost. Interestingly, the model also succeeded in improving the protein titer by 27%. Sometimes the model found some truly basic truths - cheap HEPES buffer should be used in heaps. Others were more subtle - nucleotide monophosphates are more economical than nucleotide triphosphates (NTPs) - but some NTPs must still be included.

There are many immediate directions one could expand this work. The final conditions were tested on a limited number of additional proteins, and gave good results for many but not all. So the conditions found are not as universal as one might desire - though perhaps that is just part-and-parcel of cell-free protein synthesis. It would also be interesting to run the same experiment with other cell-free protein synthesis systems, both to optimize individual ones but also to see which of the principles from this experiment are shared across different systems.

More broadly, this approach could be applied to any complex biochemical reaction. So any enzymatic reaction used in a high throughput screen or diagnostic assay or for anything else. While high throughput sequencing library preparation processes aren't nearly as complex mixtures as even PURE cell free system, there are many components in these reactions - crowding agents, monovalent ions, divalent ions, nonionic detergents, oligos, enzymes, nucleotides, glycerol, etc. Or even in who makes key components - there's an AGBT poster I'm looking forward to that explores effects on sequencing library preparation of different manufacturers SPRI bead products. There's also the whole aspect of incubation times. Library preparation reactions have a complex set of figures of merit - cost of goods, possible biases, required input DNA or RNA, speed, variability, etc.

The cell-free protein synthesis experiment had the advantage of always having clean input DNA. Sequencing library protocols don't exist in such a refined space. Contaminants from upstream DNA or RNA extraction are likely - ionic detergents, chaotropic salts, phenol, ethanol, heme, humic acids. Understanding the tolerance of a library reaction to these can be very

valuable intelligence for a reagent provider - particularly if some of these interact in some unforeseen way. The worst outcome would be if the reaction worked with contaminant A or contaminant B - which the provider trumpets - but not if both are present.

On an autonomous laboratory one could have a similar model-driven lab-in-a-loop exploring this complex space, able to access a wide variety of reaction components and model contaminants. If we had the n6tec iconPCR in a RAC, then one could explore (at 96-well scale) a wide variety of incubation regimes - the n6 folks let me know at SLAS that their software now supports operating every Peltier/well as an independent unit. If a 384-well plate to be fit into the system (requiring an adapter perhaps?), then one could test four different reaction mixtures per thermal profile, which isn't as exciting as 384 different profiles but could still enable exploring a much larger number of thermal x chemistry combinations than being limited to only 96 per plate. Alas, n6 has decided to skip AGBT and focus on other conferences, so this late-occurring thought can't be discussed until a future date.

Sequencing offers a much richer output than the simple fluorescence intensity readout used in the OpenAI experiment, so one could look for biases in library fragment termini, in the GC content or length. And if I could get a sequencer integrated into the autonomous laboratory, just think how quick the agent could run design-build-test-learn cycles!

Does this intrigue you? Get in touch with me - I certainly don't see the cell-free protein synthesis optimization paper as a one-off!

Omics! Omics!

Sunday, February 22, 2026

AI Agent -Driven Biochemical Optimization

No comments:

Google meta tag

Get new posts by email: