In The News

Keeping up with the protein designers

May 11, 2023

As machine learning methods leapfrog each other, protein design companies must navigate the benefits and challenges of an ever-evolving toolbox

BY KAREN TKACH TUZMAN, DIRECTOR OF BIOPHARMA INTELLIGENCE

The rapid evolution of de novo protein design methods means companies must remain poised to “bake-off” new approaches as they come out, while weighing the value of keeping innovations proprietary versus staying as current and compatible as possible by being open.

The past few years have seen big steps forward in several types of computational technologies related to protein design, including those that generate desired scaffolds, those that identify amino acid sequences capable of forming those scaffolds, and those that predict the final conformations of amino acid sequences (see Figure).

David Baker’s Institute for Protein Design (IPD) lab at the University of Washington, which is among the biggest drivers of these innovations, essentially replaced some of its own scaffold generation technology in a matter of months.

Baker told BioCentury his lab’s RF Diffusion method, whose code has gotten “many thousands of downloads” since it became available online March 30, has already partially displaced the “in-painting” and “constrained hallucination” approaches his team published in Science just last July.

“This field is very cruel, it moves very fast,” he said. “They’re no longer the favorite children.”

Part of the reason the field is progressing so quickly is that in shifting its underlying technology from physical models to machine learning, it can borrow techniques from a wider range of industries. “Solutions to problems are now coming from efforts in completely different disciplines,” Baker said.

For example, IPD’s RF Diffusion method, and another diffusion model approach from Generate Biomedicines Inc. named Chroma, both use a “de-noising” technique similar to the one used by the DALL-E image generation software. The Baker lab’s latest paper, published in Science on April 20, applies a reinforcement learning approach similar to those used to solve games such as chess and Go.

IPD spinouts Outpace Bio Inc. and Monod Bio Inc. routinely test new protein design innovations as they emerge, remaining open to methods that out-perform their original technologies. “The software itself is not what’s differentiating, it’s how we use it,” said Outpace co-founder and CEO Marc Lajoie.

Biotechs have always had to deal with a tension between investing in technology innovation versus bringing a product to market as efficiently as possible, but AI’s rapid evolution has made the issue particularly acute for companies where these methods are fundamental to the earliest stages of pipeline development.

“By ‘older methods’ we mean like a year old.” Daniel-Adriano Silva, Monod Bio

For de novo biologics company Generate, how to balance platform innovation and product development was a key question the company considered early on, co-founder and CTO Gevorg Grigoryan told BioCentury.

“What we found was that making those two things asynchronous was a good recipe,” he said. “The innovative engine can move at its erratic pace, and the portfolio programs proceed at the pace dictated by the biological and medical questions.”

Another question companies face is whether and how to share their methods with the protein design community.

While one of the IPD spinouts, biosensor company Monod, has not disclosed its strategy for balancing internal technology development with the open innovation ecosystem, Outpace, which uses protein design to engineer modular controls for T cell therapies, has leaned into the open exchange of new methods as a way to ensure access to the most advanced technologies.

Outpace is a co-founder and member of the OpenFold Consortium, an AI ecosystem for biology tools with a permissive software license model that allows commercial and non-commercial use. Other drug developers with disclosed membership are Bayer AG (Xetra:BAYN), Charm Therapeutics Ltd.
and Cyrus Biotechnology Inc.

According to Lajoie, companies that “fork” their code away from the open innovation ecosystem can end up with methods that are no longer backward-compatible. “You can’t access the innovation others are making, that’s the price you pay,” he said.

Grigoryan said Generate plans to publish the code behind its Chroma method when its peer-reviewed paper comes out.

“We’re excited to maximize the impact of the capability,” he said, noting the company believes it will retain a competitive advantage via its use of specialized, non-public data for training models. “What’s going to become the differentiating aspect is the ability to condition the generative process based on things that are bespoke.”

Model progress

Diffusion models from IPD and Generate, described in separate preprints released Dec. 1, represent a leap forward in the capacity to design protein backbones with desired shapes, without requiring any examples in nature.

Grigoryan believes diffusion models are creating a “seismic shift” in protein design by enabling exploration beyond the about 10,000-20,000 protein folds and sequence classes used by evolution. “We can wade through productive protein space more efficiently and effectively now,” he said.

The approach has caught on quickly among practitioners. “The diffusion-based methods make the best scaffolds,” Lajoie said.

The strategy starts with a random, noisy distribution of amino acids, then successively removes noise in a manner conditioned on all available protein structure and protein structure prediction data, ultimately producing a desired scaffold shape.

In contrast, prior approaches have been based on Monte Carlo simulations, which rely on repeat random sampling of all the possible ways of building a protein; these approaches are slower and more computationally intensive than diffusion modeling, and produce a more limited set of outputs.

IPD’s preprint describing RoseTTA Fold (RF) Diffusion experimentally characterized hundreds of designs, including a picomolar-affinity binder to parathyroid hormone, and novel symmetric assemblies experimentally confirmed by electron microscopy.

Generate’s preprint on Chroma highlighted the method’s capacity for generating protein structures with desired symmetry, substructures and shapes. Key features include the efficiency with which the method scales with protein size, and its ability to assign each design a likelihood of actually producing a protein, said Grigoryan. He added that users can tune the approach for greater certainty, which reduces structure diversity, or vice versa.

While diffusion modeling is becoming a go-to approach for many scaffold design problems, other methods are still providing optimized solutions for specific types of challenges, with the Baker lab’s two latest papers providing examples.

The team’s reinforcement learning method, published in Science, uses Monte Carlo simulations for top-down generation of scaffolds for multi-subunit protein assemblies, such as icosahedral nanoparticles for vaccine or therapeutic delivery.

This is a “step-change” from prior bottom-up strategies that designed individual subunits before putting them together, said Baker, noting those designs tended to be “spindly,” with undesired holes.

“Now what we can do is design the individual subunits with the constraint that they fit together perfectly,” he said.

“As exciting methods come in, you generate data 10 times faster, which makes the importance of good governance really stand out.”
Eric Soller, Outpace Bio

The approach builds the desired scaffold in a series of “moves,” and the weights assigned to those moves are adjusted by the extent to which they give rise to the desired shape.

A separate IPD study published April 10 in Nature solves a different problem — designing protein binders for peptides and intrinsically disordered proteins — largely using Rosetta, a computational modeling approach based on physics calculations that doesn’t involve machine learning at all.

The team designed proteins with “pockets” that bound the side chains of specific amino acids, linking them up such that they would bind peptides and disordered proteins in their linear conformations.

“It’s not recognizing protein structure, it’s more like the way that DNA and RNA would hybridize,” Baker said.

Chroma encompasses both the capacity to make scaffolds with desired functions, and to identify the amino acids that would give rise to those scaffolds. In contrast, the IPD team designs scaffolds via specialized methods such as RF Diffusion or reinforcement learning, and then uses its Protein MPNN approach, published in Science last September, to determine the corresponding amino acid sequence. Protein structure prediction algorithms such as AlphaFold and RoseTTAFold then provide a picture of what the protein generated by that string of amino acids would look like.

At the pace the innovation is going, any of these tools could soon be supplanted, making it critical that companies stay abreast of the latest advances and build systems that keep them compatible.

Efficiency and governance

For companies, recent innovations in protein design methods have reduced barriers to entry, said Monod co-founder, President and CEO Daniel-Adriano Silva. “In the past, it was only the very expert who were able to do it, and now it’s becoming more accessible,” he said.

“Things that used to take hundreds of lines of code just take a few,” Lajoie said, adding that the need for expertise is still acute downstream. “You still need design expertise to be able to know, how good is this design, is it believable, what are its pathologies, and how to overcome them.”

The shift to machine learning approaches from physical models has also increased efficiency by “orders of magnitude,” said Silva.

Previously, it took about a year to develop a functional protein, natural protein domains were required as templates, and groups would have to test thousands of designs experimentally to find one or two that worked. Now, simulations can be completed in minutes, experimental screens involve tens or hundreds of proteins, and the success rate among those tested is 10-30%, with the overall process taking a month or two, he said.

Monod is developing diagnostics that incorporate de novo designed biomarker-binding proteins and luciferase reporters that use synthetic substrates, licensed from the Baker lab.

The latter were generated via “older” Monte Carlo simulation-based protein hallucination methods, described in a February 2023 Nature study, which preceded the recent diffusion models, said Silva. “By ‘older methods’ we mean like a year old.”

He said the company is planning to use the newer methods to explore additional luciferase designs and other de novo enzymes.

Silva thinks the next part of the protein design process that is ripe for innovation is the experimental methods used to test and optimize designs. “In the future, we’re going to take machine learning and automation tools and start learning more about that process,” he said.

Lajoie said Outpace has “heavily invested” in the ability to rapidly iterate through cloning, protein expression and cell therapy prototyping, which enables the company to compare new methods such as diffusion modeling head-to-head with its prior approaches.

Having an expanded toolbox makes it all the more important to have well-defined target product profiles and a clear understanding of the bar that needs to be hit, said Outpace CBO Eric Soller. “As exciting methods come in, you generate data 10 times faster, which makes the importance of good governance really stand out,” he said.

At Generate, the pipeline portfolio team doesn’t start using new technologies until the platform innovation team has “productionized” those capabilities into automated protocols, at which time programs that are still in the design stage can benefit from the expanded toolbox, Grigoryan said.

The challenge of keeping up with constant technological advances is also key for the T cell therapy field, which is seeing a steady stream of publications on cell optimization strategies, and a shift toward allogeneic and in vivo platforms among next-generation companies. Outpace is developing technologies such as conditionally active cytokines and genetically encoded protein degraders in a way it believes will work with any immune cell type and delivery vector.

“It’s a different example of the same problem, the tech advances, and now you’re out of date,” Lajoie said. “We think a lot about, how do we make sure the technologies we’re developing are going to be forward-compatible with the technologies of tomorrow.”

‍

Keeping up with the protein designers

As machine learning methods leapfrog each other, protein design companies must navigate the benefits and challenges of an ever-evolving toolbox

Model progress

Efficiency and governance

Related Articles

Connect With Us