DOI: 10.65398/GLIO2666
Prof. Frances Hamilton Arnold (PAS), Professor of Chemical Engineering, Bioengineering and Biochemistry at the California Institute of Technology (Caltech), External co-chair of President Joe Biden’s Council of Advisors on Science and Technology
Evolution and AI: Bringing New Chemistry to Life
Directed enzyme evolution for sustainable chemistry
Evolution is the most powerful design process ever invented. A simple algorithm of mutation and natural selection has given rise to all the diversity of life, including the stunningly beautiful chemistry that is constantly building and breaking down, the biological world. Life renews itself using abundant and renewable resources, with little waste, recycling its materials, repairing and even reinventing itself as it goes – and all this innovation comes from evolution. When artificial intelligence enables us to read, understand, and even compose the DNA language of life, will comparable innovations be within our reach?
I have been thinking about a sub-problem of this big question: how might AI lead to innovations that will enable humans to live more sustainably on our fragile planet, for example using biology’s remarkable chemistry as a model? Nearly all of biology’s chemistry is catalyzed by protein catalysts, the enzymes, that can work together inside a little bag of reagents called a cell. Enzymes assemble new cells with the ability to develop into a whole organism, repair a wound, or generate wonderfully complex and functional materials such as wood or a shell, all the while deriving energy and starting materials from their environments. I have long been fascinated by enzymes, particularly for how we might use these amazing inventions to replace dirty human chemistry in making what we need for our daily lives. But to use enzymes for our purposes – from better laundry to manufacturing pharmaceuticals to creating fuels and materials from sunlight and CO2 – we have to learn how to compose new ones. I would like to expand the DNA-encoded chemistry of the biological world to encompass useful chemistry, including the best of that invented by human chemists.
By freeing enzymes from the constraints of supporting current life, we can explore the universe of life’s chemical possibilities. Since the beginning of life, nature has explored the tiniest fraction of this space of possibilities. Out beyond where nature has gone lie solutions to the climate crisis, cures for cancer, or how to feed our growing population without destroying our beautiful natural world. The challenge is to find those rare, useful new proteins, because most of the staggering number of possible proteins don’t do anything at all, much less solve human problems.
And it really is a universe of possibilities. In fact, the number of possible proteins is many orders of magnitude greater than the number of particles in the universe. The philosopher Daniel Dennett (Dennett, 1995) made the wonderful analogy of all the possible biologies (genomes) to Jorge Luis Borges’ Library of Babel, the library of all possible books. There are unfortunately many, many more ways to be not-alive in the library of possible genomes, just as the vast majority of Babel’s books contain pure gibberish (Dennett, 1995). The collection of all possible proteins is also vast, and the density of ‘meaningful’ sequences miniscule. But, unlike the librarians of Borges’ vast random book collection, we the librarians of the protein collection have an easier time finding the meaningful sequences, because we have evolution to guide us.
John Maynard Smith (1970) described a conceptual space such that each protein sequence is surrounded by all its single-mutant neighbors. For evolution to occur, he argued, a functional protein must be surrounded by at least one functional protein. Evolution could then pass through the network of functional protein sequences to explore new possibilities, one mutation at a time. This process has given rise to the billions upon billions of functional proteins that surround us today, those rare meaningful sequences that we can literally scrape from the bottom of our shoes, and it can give rise to more.
This led to the proposition, formulated in the late 1980s, that such an exploration, one (or a few) mutations at a time, could discover not just new enzymes, but improved ones – including enzymes that would be useful to humans (Arnold, 2019). Since then, such ‘directed evolution’ experiments have led to many thousands of enzymes used in products from laundry detergents to disease therapies.
With technologies that allow us to make and manipulate DNA in the test tube and encode enzymes in recombinant organisms, we can direct enzyme evolution towards new functions using artificial selection. We think of this is as an optimization problem on a so-called fitness landscape in Maynard Smith’s sequence space, where fitness is now defined by the enzyme engineer rather than natural selection. It could be the ability to catalyze useful chemistry in a completely non-natural environment like an organic solvent, to catalyze a reaction on whole new substrates, or even to catalyze a whole new chemical reaction.
Directed evolution allows us to explore the future, including chemistry nature has never performed, at least as far as we know. Already in 2016, we published the first enzymes that made the carbon-silicon bonds in a living system, chemistry not known in the biological world, but well known to human chemistry (Kan et al., 2016). And just this year, we made the first enzymes shown to break carbon-silicon bonds (Sarai et al., 2024). (Materials with carbon-silicon bonds make up a huge industry, all made using human chemistry). I dream of the day when all of the chemistry that creates our products can be encoded in DNA and performed cleanly, as nature does. Enzymes can also ensure that the products we make are recyclable or biodegradable.
The role of artificial intelligence and machine learning in enzyme design
And where does artificial intelligence enter this picture? Artificial intelligence and machine learning will greatly enhance our ability discover new enzymes and improve them for our purposes. Directed evolution is a simple iterative process of making (a small number of) random mutations and screening the mutated enzymes for their ability to exhibit the desired features. The screening process generates functional data, and it is not hard to label the functions with the sequences that encode them. Screening samples the fitness landscape for enzyme evolution and optimization. With directed evolution, we would simple take the best of those sampled sequences and repeat the mutation and artificial selection, or perhaps recombine beneficial mutations for the next generation. However, the sampled sequence-function data could also be used to build a statistical model of the fitness landscape with which we could make predictions about where to search next for improved functions. In fact, we have used such machine learning methods to improve directed evolution for a dozen years (Romero et al., 2012; Yang et al., 2019). ML is becoming widely adopted to improve directed evolution, and reduce the burden of having to perform many generations of mutagenesis and high throughput screening (Yang, Li et al., 2024).
Besides learning directly from labeled data collected through experiments, the models constructed from the screening data can additionally include what we already know of proteins (e.g. in the form of evolutionary conservation or protein structures and physics) (Yang, Li et al., 2024). The models can also be updated in each generation, and future exploration can be balanced with exploitation of previous knowledge (Yang, Lal et al., 2024).
Furthermore, the starting points for such optimization can come from nature, the repository of evolution’s 4 billion years of work, or they can be generated by human design, intuition or, as is starting to happen, generative artificial intelligence. Generative AI learns from the proteins nature has already made and humans have deposited in rapidly growing databases: it learns to compose from the language of nature’s proteins and infer functional proteins that nature has not explored, but could have (Madani et al., 2023). As of today, generative AI is still in its infancy when it comes to generating useful catalytic function. But this may not be true tomorrow.
The wonderful thing about protein optimization, however, is that it is extremely robust – we can start with poor designs (which our generated designs are and are likely to be for a while, while we learn the details of catalysis) and reliably make them far more effective. In other words, evolution empowered by ML can often take a mediocre AI-generated composition and turn it into an acceptable piece of music, if not a Beethoven symphony. As far as I know, this robust and reliable optimization process is unique to proteins, which themselves have been designed by evolution. There is no guarantee that other forms of matter – chemicals, materials, etc. – will be so readily improved, at least until we have the required data and understand how to formulate the ‘sequence space’ so that optimization is easy.
Automated research workflows (ARWs)
Now it is getting really interesting. All of the steps of directed protein evolution can be automated. Not only that, the steps can be assembled, optimized, and controlled using AI. In fact, there are now fully automated ‘cloud’ labs, or autonomous laboratories, where enzyme engineering experiments can be carried out. An AI agent can even design the experiment, start it, acquire the data, and then improve its model of the system in iterative cycles of learning, just as I described for ML-assisted directed evolution. The ML model can be updated in each cycle, based on the observations from testing, so that the system learns as it goes and can design the next data collection for the next generation of evolutionary improvement. This is not science fiction – Prof. Phil Romero recently published an early example of such a process (Rapp et al., 2024).
What I have described is an example of an Automated Research Workflow or ARW. ARWs integrate computation, laboratory automation, and tools from AI into the research process, from designing experiments to analyzing data and learning from the results to inform further experiments (NASEM, 2022). An automated laboratory can perform experiments 24 hours a day, 7 days a week, and it can be controlled from a distance, allowing access to users without deep disciplinary knowledge. Such automated exploration and evolution would enable us to create new genetically-encoded chemistry much more efficiently than a graduate student or postdoc can do, and it would free them from highly repetitive experimentation to focus on other tasks, such as gaining a biochemical understanding of the results. A bit over two years ago, as an April Fool’s joke, on Twitter (now X) I announced that students could just push a button to create their dream enzymes (https://x.com/francesarnold/status/1510017434090020864). This was met with honest excitement, and already is close to reality. I predict that the arduous experimental optimization will become streamlined and fully automated and, once generative AI can propose new enzymes, will make the dream of expanding the chemistry of biology as easy as pressing a button (with money, of course). Such ARWs for protein engineering may offer additional benefits that include enhanced capture of provenance, integrity, and reproducibility (NASEM, 2022).
What are the downsides? We must address the role of humans in the discovery loop, privacy of data, and impact on current incentive systems for researchers. As discussed in a recent report from the National Academy of Sciences on ARWs, additional important questions include,
“What unforeseen technical and ethical issues may arise? Who “owns” the data and discoveries that are produced by automated and distributed systems? How should researchers evolve their practices to reap the benefits of automation while not losing the serendipity of human inspiration and creativity? What goals are best achieved by human scientists (such as invention of new techniques) and which are better left to automation (such as driving data collection to optimize models?” (NASEM, 2022).
An important issue that must be addressed is biosecurity. How will we make sure that bad actors do not use such powerful biological design capabilities to unleash the next pandemic or an agent that targets a selected population for destruction? How do we make sure that simple mistakes or unforeseen consequences of such biological novelty do not become catastrophic? The ability to generate new toxins, resistance mechanisms, or infectivities is not terribly far from the ability to create new chemistry. There will need to be monitoring of such tasks, which is easier if the experiments are done in centralized laboratories.
Conclusions
AI and machine learning are super-charging many areas of science (PCAST, 2024), promising everything from better weather predictions to discovery of new materials and drugs. What is the role of academic research in an era of big models and ARWs? I have outlined a vision for creating new enzymes, but this vision is valid for a much wider range of biological possibilities, from developing novel therapies to discovering the secrets of aging (and perhaps how to intervene). We are at the cusp of an exciting era in science, where AI and ML could empower scientists to do much more with the limited resources we have. My hope is that we will focus on the right problems, train the right people, and use this power for the benefit of the planet rather than find new ways to exploit it.
Acknowledgment
Research in the Arnold laboratory on ML-guided enzyme engineering is supported by the Army Research Laboratory (W911NF-19-2-0026).
References
Maynard Smith, J., “Natural Selection and the Concept of a Protein Space.” Nature 225, 563-564 (1970).
Arnold, F.H. “The Library of Maynard Smith: My Search for Meaning in the Protein Universe.” Microbe, ASM News 6, 316-318 (2011).
Arnold, F.H. “Innovation by Evolution: Bringing New Chemistry to Life.” Angewandte Chemie International Edition 58, 14420-14426 (2019).
Kan, S.B.J., Lewis, R.D., Chen, K., Arnold, F.H., “Directed Evolution of Cytochrome C for Carbon-Silicon Bond Formation: Bringing Silicon to Life.” Science 354, 1048-1051 (2016).
Sarai, N.S., Fulton, T.J., O’Meara, R.L., Johnston, K.E., Brinkmann-Chen, S., Maar, R.R., Tecklenburg, R.E., Roberts, J.M., Reddel, J.C.T., Katsoulis, D.E., Arnold, F.H., “Directed Evolution of Enzymatic Silicon-Carbon Bond Cleavage in Siloxanes. Science 383(6681), 438-443 (2024).
Romero, P.A., Krause, A., Arnold, F.H., “Navigating the Protein Fitness Landscape with Gaussian Processes.” Proceedings of the National Academy of Sciences USA 110(3), E193-E201 (2012).
Yang, K.K., Wu, Z., Arnold, F.H., “Machine-Learning-Guided Directed Evolution for Protein Engineering.” Nature Methods 16, 687-694 (2019).
Yang, J., Li, F.-Z., Arnold, F.H., “Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering.” ACS Central Science 10(2), 226-241 (2024).
Yang, J., Lal, R.G., Bowden, J.C., Astudillo, R., Hameedi, M.A., Kaur, S., Hill, M., Yue, Y., Arnold, F.H., “Active Learning-Assisted Directed Evolution.” bioRxiv: https://www.biorxiv.org/content/10.1101/2024.07.27.605457v1
Madani, A., Krause, B., Greene, E.R., Subramanian, S. Mohr, B.P., Holton, J.M., Olmos Jr., J.L., Xiong, C., Sun, Z.Z., Socher, R., Fraser, J.S., Naik, M., “Large Language Models Generate Functional Protein Sequences Across Diverse Families.” Nature Biotechnology 41, 1099-1106 (2023).
“Automated Research Workflows for Accelerated Discovery: Closing the Knowledge Discovery Loop.” Consensus Study Report (2022). National Academies of Sciences, Engineering, and Medicine, Washington, DC. https://doi.org/10.17226/26532
Rapp, J.T., Bremer, B.J., Romero, P.A. “Self-Driving Laboratories to Autonomously Navigate the Protein Fitness Landscape.” Nature Chemical Engineering 1, 97-107 (2024).
Report to the President, “Supercharging Research: Harnessing Artificial Intelligence to Meet Global Challenges.” President’s Council of Advisors on Science and Technology (PCAST), April 2024. Executive Office of the President, Washington D.C. https://www.whitehouse.gov/wp-content/uploads/2024/04/AI-Report_Upload_29APRIL2024_SEND-2.pdf