Connect with us

AI

Genetic Code Expansion’s ML Edge Lives in Lab Data, Not Models

Machine learning for genetic code expansion is bottlenecked by proprietary lab data, not algorithms. Three platforms are racing to generate that data.

Published

on

Machine learning for genetic code expansion has moved from academic curiosity to a commercial platform in 2026. Genetic code expansion, or GCE, lets living cells build proteins from amino acids beyond the standard 20, and that chemistry already powers GLP-1 drugs such as semaglutide and a new generation of site-specific antibody-drug conjugates for cancer. The bottleneck sits in proprietary laboratory data that only a handful of companies can produce, generated inside experimental platforms that take years to build.

What Genetic Code Expansion Actually Is

Every living cell translates DNA into protein using the same rulebook: 64 three-letter sequences called codons, mapped to 20 amino acids. Genetic code expansion, or GCE, is the process of reprogramming how a cell reads its DNA so its ribosome incorporates non-canonical amino acids, or ncAAs, at positions defined in the DNA sequence. The technique has been implemented across microbes, mammalian cells, and animals for more than 200 distinct ncAAs.

Mechanistically, GCE works by adding a synthetic pair of biomolecules to the cell: an engineered aminoacyl-tRNA synthetase and an engineered tRNA that reads a blank codon the cell does not already use. The synthetase recognizes only the new amino acid and only its partner tRNA, so the rest of the cell’s machinery keeps functioning normally. Every additional amino acid the cell is asked to handle requires a new bespoke pair, and the engineering work multiplies with each one. The result is a system that can in principle incorporate any amino acid that a cell can be coaxed to charge onto a tRNA.

The work multiplies because each pair must integrate with a system that is already running a tightly choreographed process. A cell’s protein production machinery is highly interconnected, with little room for error but a large design space, and the same pair that performs well in E. coli can fail in yeast or mammalian cells. Adding one new amino acid is solvable. Adding twelve at once, in a cell-free system, is the kind of problem that resists intuition. A 2026 paper in Nucleic Acids Research from Takayuki Katoh and colleagues reached 32 distinct amino acids simultaneously in a reconstituted in vitro translation system, the 20 canonical plus 11 non-canonical elongators and one initiator, by carefully tuning tRNA modifications and translation conditions.

That cell-free system is one of three platforms racing to push GCE beyond academic demonstrations and into commercial products. Each platform has staked out a different combination of organism, throughput, and protein type. The platforms differ in organism, scope, and the kind of ncAA library they can produce, and the differences are reflected in the strategic bets the companies behind them are making.

The Industrial Pull: From GLP-1 Drugs to Site-Specific ADCs

GCE has stopped being a research project. Commercial products already depend on it, and the next round of investment is being shaped by what works today rather than what might work in five years. Semaglutide, the active ingredient in Ozempic and Wegovy, is a peptide that requires a non-canonical amino acid at its N-terminus, and Novo Nordisk manufactures it via solid-phase synthesis because no scalable microbial route yet exists. That single supply chain has been a recurring bottleneck for GLP-1 demand.

The therapeutic classes that already depend on ncAA chemistry include four broad categories. Each one is in commercial production today, and each one faces a different manufacturing and scale problem.

  • GLP-1 receptor agonists such as semaglutide, where an ncAA at the N-terminus drives the molecule’s half-life and receptor binding.
  • Site-specific antibody-drug conjugates, where an ncAA installed at a defined position gives chemists a single chemical handle to attach a cytotoxic payload without disrupting the antibody’s behavior.
  • Engineered industrial enzymes, where halogenated ncAAs such as trifluoroleucine have delivered a 27-fold improvement in half-life for chloramphenicol acetyltransferase and a 22°C rise in melting temperature for a fluorinated leucine zipper.
  • Macrocyclic peptide therapeutics, which use ncAA chemistry to constrain the peptide ring and tune its drug-like properties.

And the same chemical logic is what Unnatural Products, a California biotech developing orally delivered macrocyclic peptides against targets that small molecules cannot reach, is commercializing. The company closed a US$45 million Series B in March 2026, led by The Venture Collective, with argenx, Droia Ventures, and existing investors Merck Global Health Innovation Fund, Artis Ventures, and First Spark Ventures participating. The capital funds the company’s drug discovery platform, not a single molecule, and that is a signal that the platform itself, the ability to design and screen ncAA-bearing peptides, is now what investors are underwriting. A wider bet on peptide discovery is being run by PeptiDream, listed on the Tokyo Stock Exchange Prime Market, whose peptide discovery platform has anchored collaborations with multiple large pharmaceutical companies. Together with Unnatural Products, PeptiDream’s peptide work shows that ncAA chemistry has crossed from academic curiosity into therapeutic product pipelines.

The Engineering Bottleneck That Makes GCE an AI Problem

Adding one new amino acid is solvable. Adding twelve at once, in a cell-free system, is the kind of problem that resists intuition. The complexity is combinatorial: every additional ncAA demands a new synthetase/tRNA pair, every pair must function orthogonally to every other pair, and every orthogonal pair must still cooperate with the cell’s native translation apparatus. Orthogonality means that the engineered pair does not interact with the cell’s existing machinery except in the way it is designed to, which is the property that makes GCE possible at all.

The traditional approach to designing synthetase/tRNA pairs starts from a known orthogonal scaffold, most often the pyrrolysyl-tRNA synthetase from Methanomethylophilus alvus, and mutates its amino acid binding pocket until it recognizes the new ncAA. The cycle takes months of benchwork per pair, and a single round of positive and negative selection can produce a small library where most variants do not pass muster. The number of failed variants outnumbers the working ones by orders of magnitude, and the cycle is hard to parallelize.

And the bottleneck grows nonlinearly with ambition. A cell-free system targeting 32 amino acids needs 12 orthogonal pairs that all coexist, plus engineered translation conditions that keep the native ribosome happy. A genome-recoded organism that frees up multiple codons for reassignment needs every protein in the cell to be rewritten so it no longer uses the freed codons, a multi-year effort that requires sustained infrastructure. Constructive Bio’s Syn61 E. coli strain replaced 18,214 instances of three codons (TAG, TCG, and TCA) across the entire 4-million-base-pair genome before any new amino acid could be assigned to them.

This is exactly the regime where machine learning performs best: many decisions, tight coupling between prediction and physical reality, large amounts of structured data on what worked and what did not. The reason AlphaFold succeeded is not that its neural network was clever in the abstract. It succeeded because two public databases held decades of experimental data when its training started.

The In-House Data Flywheel: Why ML Compounds Here

Here is the piece that does not show up in the academic literature. The data that trains a GCE machine learning model is proprietary, in a literal sense. Unlike the sequence data that populate public databases such as GenBank or UniProt, the screening data that comes off a GCE platform cannot be downloaded from outside sources, and it cannot be replicated without building the same physical experimental infrastructure. It can only be produced through specific laboratory systems that take years and significant resources to develop.

This is what makes these platforms so strategically valuable. AI learns from what worked and what did not, and the tight coupling between prediction and physical reality in GCE is exactly the environment where machine learning compounds most effectively. Proteins either function under manufacturing conditions or they don’t, enzymes either maintain stability at 70°C or they degrade, and the screening data carries both signal and noise in equal measure. Each cycle of design, synthesis, screening, and measurement produces new training data. Each round of training improves the next round of design, and the speed advantage compounds with every cycle.

Take the antibody-drug conjugate example. The current design process still relies heavily on trial and error to optimize the chemical attachment of a therapeutic payload to an antibody. A platform pairing high-throughput GCE with machine learning analysis could screen thousands of ncAA variants in parallel, test their performance against real biological targets, and refine its predictions based on the results, much as a separate project from UC San Diego has already shown that a single foundation model predicting cancer treatment response can deliver actionable signals from genomic data.

Three Platforms Racing to Generate the Ground Truth

The companies building GCE platforms in 2026 are organized around different bets on where the data flywheel spins fastest. Three of the most prominent differ in organism, in scope, and in the kind of ncAA library they can produce. The first, OrthoRep, runs in yeast; the second, Syn61, runs in E. coli; the third, the cell-free system from the Katoh group, runs in a reconstituted in vitro translation mix with no living cell at all. Each platform has staked out a different combination of organism, throughput, and protein type. The differences are summarized in the table below, with the underlying numbers detailed in the prose that follows.

Platform Organism Approach Key feature
OrthoRep-driven aaRS evolution Yeast (S. cerevisiae) Hypermutating orthogonal DNA replication Continuous diversification of synthetase libraries
Constructive Bio’s Syn61 E. coli Whole-genome codon replacement Three freed codons for multi-ncAA assignment
Katoh et al. cell-free system In vitro translation Codon box division with engineered tRNAs Multiple ncAAs in a single peptide at the same time

The OrthoRep approach uses an error-prone orthogonal DNA polymerase in yeast S. cerevisiae to mutate a synthetase gene at a rate of 10⁻⁵ substitutions per base while sparing the host genome. Across eight independent evolution campaigns, the platform yielded synthetase variants that incorporate 13 distinct ncAAs, with some variants matching the cell’s natural codon-anticodon pairing efficiency. As described in the paper on OrthoRep-driven evolution of synthetases, one evolved variant regulates its own expression by the ncAA, an emergent autoregulatory mechanism the authors did not design for. The implication is that orthogonal evolution can produce regulatory behavior that takes years of rational design to match.

Constructive Bio is the commercial vehicle for Jason Chin’s Syn61 work at the MRC Laboratory of Molecular Biology. The strain replaces 18,214 instances of three codons across the entire 4-million-base-pair E. coli genome, freeing them for ncAA assignment and producing a host that is resistant to bacteriophage contamination and genetically isolated from wild bacteria. Constructive Bio’s Syn61 platform also aims to manufacture ncAA-containing peptides through bacterial synthesis rather than solid-phase chemistry.

This platform offers vast potential for generating diverse macrocyclic peptide libraries with unique chemical entities for drug discovery.

The framing is from Takayuki Katoh and colleagues’ 2026 paper in Nucleic Acids Research, which describes the cell-free system in detail. The Katoh group used engineered tRNAs lacking native nucleotide modifications to split four-codon boxes in two, assigning each half to a different amino acid in a reconstituted in vitro translation mix. With translation conditions optimized, they incorporated 11 non-canonical elongators and one initiator N-chloroacetyl-D-tyrosine, which forms a thioether bond with a downstream cysteine to drive peptide macrocyclization. The cell-free 32-amino-acid system from Katoh et al. is published as a route to diverse macrocyclic peptide libraries for drug discovery, and it is the first platform to assemble this many amino acids simultaneously without a living cell.

The AlphaFold Parallax: Why the Database Does Not Exist Yet

AlphaFold’s success depended on two databases that took decades to assemble. The Protein Data Bank holds the experimentally determined three-dimensional structures of proteins. UniProt holds the sequences and functional annotations, and AlphaFold found patterns in data that already existed.

For proteins that incorporate non-canonical amino acids, the equivalent foundation does not yet exist. There is no public archive of ncAA-bearing protein structures, no equivalent of UniProt’s 214 million entries (the count AlphaFold DB itself crossed as of September 2023), and no canonical benchmark for how an ncAA affects folding, stability, or binding. The data that would feed such a foundation can only be produced by organizations running GCE platforms, and those organizations have no incentive to release it. The result is a structural gap that the field is aware of and that is unlikely to be closed by any one player, because every organization producing ncAA-bearing data is using it to train its own models. Constructive Bio’s white paper frames Syn61 as a platform that generates proprietary data on ncAA-containing proteins, and the strategic implication is that the data is the product.

GRO Biosciences, a Cambridge, Massachusetts startup commercializing George Church’s genomic recoding work, is building its own organism with seven codons reassigned, enough to install four distinct non-standard amino acids in the same protein at the same time. Each of these platforms will produce, over the next decade, a private training corpus that no public mirror will match. And AlphaFold DB’s 214 million predicted structures remain, for the moment, entirely a product of the standard 20 amino acids. Until that data exists at scale for ncAA chemistry, the AlphaFold approach cannot be transferred to the GCE field without rebuilding the underlying data, which is exactly the work the proprietary platforms are now doing.

What Would Change the Equation

The trajectory of GCE and machine learning over the next five years depends on a small number of concrete decisions. Each is being made by named players, and each has a measurable signal that investors and researchers can track. The first signal is whether any GCE platform opens even a narrow band of its screening data to the public, which would let the broader field build models on shared ground. The second is whether the cell-free 32-amino-acid system moves out of the academic lab and into a startup, which would test whether the manufacturing economics differ enough from the Syn61 and OrthoRep playbooks to support a new company. The third is whether a peptide currently made by Novo Nordisk’s solid-phase process is instead made at fermentation scale by a recoded organism, which would be the inflection point at which GCE stops being a tool and becomes a manufacturing substrate.

  1. Public archive of ncAA structures. No public consortium has funded a Protein Data Bank equivalent for ncAA chemistry. If a coalition of academic groups assembled a public archive of even 10,000 ncAA-bearing protein structures with their measured stabilities, the AlphaFold-style models that emerged would not belong to any one company. The data does not yet exist at that scale, and the platforms producing it are not motivated to release it.
  2. Commercial bet on cell-free ncAA manufacturing. The Katoh platform produces macrocyclic peptides directly, without living cells, and the manufacturing economics differ from anything in the Syn61 or OrthoRep playbooks. If a company is founded around the platform in 2026 or 2027, it will be the first commercial bet on cell-free ncAA manufacturing at this scale, and it will not need to free codons from a genome first. The capital intensity is lower than for a recoded organism, and the product mix is narrower but more controllable.
  3. GLP-1 peptide made by fermentation. Constructive Bio has stated that its goal is to manufacture ncAA-containing peptides through bacterial synthesis rather than solid-phase chemistry. If a peptide currently made by Novo Nordisk’s existing process is made at fermentation scale by a recoded organism, that is the inflection point at which GCE becomes a manufacturing substrate. The patterns behind why most lab AI pilots never reach production will determine which of these bets actually ships.

And the hidden variable is the same one that defined AlphaFold’s path. The organization that produces the most useful data first will define the model everyone else trains on. The race is being run in physical laboratories, and the participants are not the ones publishing the largest language models.

Frequently Asked Questions

What is genetic code expansion?

Genetic code expansion is a set of techniques that reprogram a living cell or a cell-free system so its ribosome incorporates non-canonical amino acids, amino acids beyond the standard 20, at positions defined in the DNA. The standard tools are an engineered aminoacyl-tRNA synthetase and its partner tRNA, which together recognize a blank codon the cell does not use. The technique has been demonstrated for more than 200 distinct non-canonical amino acids in organisms ranging from E. coli to mammalian cells.

Which commercial drugs already use non-canonical amino acids?

GLP-1 receptor agonists such as semaglutide require a non-canonical amino acid at their N-terminus and are manufactured today by solid-phase synthesis rather than fermentation. Site-specific antibody-drug conjugates use genetically encoded non-canonical amino acids as chemical handles for cytotoxic payloads. Macrocyclic peptide therapeutics, the focus of companies such as Unnatural Products and PeptiDream, are an active therapeutic class where non-canonical amino acids constrain the peptide ring.

How does machine learning help with genetic code expansion?

Machine learning models can predict which aminoacyl-tRNA synthetase mutations will recognize a new non-canonical amino acid, which orthogonal pairs will coexist inside one cell, and which ncAA substitutions will stabilize a protein under industrial conditions. The predictions are then tested experimentally, and the results feed the next round of training. The performance of the loop depends on the volume and quality of proprietary screening data, which is why platforms with the most extensive laboratory infrastructure tend to produce the best models.

What is the bottleneck for scaling GCE in industry?

The bottleneck is the engineering of orthogonal synthetase/tRNA pairs and the integration of multiple such pairs inside a single production host. Each pair is bespoke, each pair must not interact with the cell’s native translation machinery except in the way it is designed to, and each pair must function at the yield required for biomanufacturing. The traditional approach, directed evolution by sequential selection rounds, takes months per pair. Machine learning can compress that cycle, but only if the platform that generates the training data is already running.

Why can’t public AI models like AlphaFold be applied to GCE?

AlphaFold was trained on the Protein Data Bank and UniProt, two public databases built over decades of experimental work on proteins made from the standard 20 amino acids. There is no equivalent public archive of structures for proteins that contain non-canonical amino acids, no UniProt equivalent for ncAA chemistry, and no canonical benchmark for ncAA effects on folding or binding. AlphaFold-style models do not work for GCE because the training data does not yet exist at scale.

Which companies are leading GCE platform development?

Constructive Bio in Cambridge, UK commercializes the Syn61 E. coli strain from Jason Chin’s laboratory at the MRC Laboratory of Molecular Biology. GRO Biosciences in Cambridge, Massachusetts, commercializes George Church’s genomic recoding work. Unnatural Products in California focuses on macrocyclic peptides and closed a US$45 million Series B in March 2026. PeptiDream, listed on the Tokyo Stock Exchange Prime Market, has built a substantial peptide discovery platform on related chemistry.

Disclaimer: This article is for informational purposes only and does not constitute investment advice. Biotech and AI investments carry substantial risk, including the possibility of total loss of capital. Figures and company statuses are accurate as of publication.

Logan Pierce is a writer and web publisher with over seven years of experience covering consumer technology. He has published work on independent tech blogs and freelance bylines covering Android devices, privacy focused software, and budget gadgets. Logan founded Oton Technology to publish clear, no nonsense tech news and reviews based on real hands on testing. He has personally tested and reviewed dozens of mid range and budget Android phones, written extensively about app privacy, and built and managed multiple WordPress publications over the past decade. Logan holds a bachelor's degree in English and studied digital marketing at a certificate level.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending