To engineer a protein with a useful function, researchers typically start with a natural protein with a desirable function, such as emitting fluorescent light, and then go through several rounds of random mutations to eventually produce an optimized version of the protein.
This process has resulted in optimized versions of many important proteins, including green fluorescent protein (GFP). However, for other proteins, generating optimized versions has proven difficult. MIT researchers have developed a computational approach that makes it easier to predict which mutations will produce better proteins based on relatively small amounts of data.
The researchers used this model to generate proteins with mutations expected to lead to improved versions of GFP and proteins from adeno-associated viruses (AAV), which are used to deliver DNA for gene therapy. They hope it can also be used to develop additional tools for neuroscience research and medical applications.
“Protein design is a difficult problem because the mapping from DNA sequence to protein structure and function is very complex. There can be large protein changes in sequence, but each intermediate change can correspond to a completely non-functional protein. It’s like trying to find your way from a mountain range to a river valley, but there are rugged peaks along the way that block your view. “Current research is trying to make riverbeds easier to find,” says Ila Fiete, professor of brain and cognitive sciences at MIT, member of the MIT McGovern Institute for Brain Research, and director of the K. Lisa Yang Center for Integrative Computational Neuroscience. One of the lead authors of the study.
Regina Barzilay, Distinguished Professor of AI and Health Engineering at MIT, and Tommi Jaakkola, Thomas Siebel Professor of Electrical Engineering and Computer Science at MIT, are also senior authors of an open-access paper about the study. It was presented at the International Conference on Learning Representations last May. MIT graduate students Andrew Kirjner and Jason Yim are lead authors of the study. Other authors include Shahar Bracha, a postdoctoral researcher at MIT, and Raman Samusevich, a graduate student at the Czech Institute of Technology.
protein optimization
Many naturally occurring proteins have functions that can be useful for research or medical applications, but require some additional engineering to optimize them. In this study, the researchers were originally interested in developing proteins that could be used as voltage indicators in living cells. Produced by some bacteria and algae, this protein emits fluorescence when a translocation is detected. When engineered for use in mammalian cells, these proteins allow researchers to measure neuron activity without using electrodes.
Decades of research have gone into manipulating these proteins to produce stronger fluorescent signals, but they have not been effective enough for widespread use on faster time scales. Bracha, who works in Edward Boyden’s lab at the McGovern Institute, contacted Fiete’s lab to see if they could work together on computational approaches that could help speed up the protein optimization process.
“This work exemplifies the human serendipity that characterizes many scientific discoveries,” says Fiete. “It grew out of the Yang Tan Collective retreat, a scientific meeting of researchers from several centers at MIT with a unique mission united by the collaborative support of K. Lisa Yang. “We learned that some of our interests and tools in modeling how the brain learns and optimizes can be applied to completely different areas of protein design, such as those practiced in the Boyden lab.”
For a particular protein that researchers want to optimize, the number of possible sequences that can be created by swapping different amino acids at each point in the sequence is almost infinite. Because there are so many possible variations, it is impossible to test all of them experimentally. So the researchers turned to computer modeling to predict which variants would work best.
In this study, the researchers set out to overcome these challenges by using data from GFP to develop and test computational models that can predict better versions of the protein.
They started by training a type of model called a convolutional neural network (CNN) on experimental data consisting of GFP sequences and their brightness (the feature they wanted to optimize).
Based on a relatively small amount of experimental data (from about 1,000 variants), the model was able to generate a “fitness landscape,” a three-dimensional map that indicates the fitness of a given protein and how much it differs from its original sequence. GFP).
These landscapes contain peaks representing more fit proteins and valleys representing less fit proteins. Predicting the path a protein must follow to reach its peak of fitness can be difficult. This is because often proteins must undergo mutations that lower their fitness before reaching the peak of higher fitness. To overcome this problem, the researchers used existing computational techniques to “smooth” the fitness landscape.
After these small protrusions in the landscape were alleviated, the researchers found that they could retrain the CNN model to more easily reach larger fitness peaks. The model was able to predict optimized GFP sequences containing up to seven different amino acids from the starting protein sequence, and the best of these proteins was estimated to be approximately 2.5 times better than the original protein.
“If we have a landscape that represents what the model thinks is nearby, we smooth it out and then retrain the model with a smoother version of the landscape,” says Kirjner. “Now there is a smooth path from the starting point to the summit, which the model can get there through iterative small improvements. The same thing is often impossible in a non-smooth landscape.”
proof of concept
The researchers also showed that this approach was effective in identifying new sequences for the viral capsid of adeno-associated virus (AAV), a viral vector commonly used for DNA delivery. In this case, they optimized the capsid for its ability to package DNA payload.
“We used GFP and AAV as a proof-of-concept to show that this is a method that works on very well-characterized data sets and because of that, it should also be applicable to other protein engineering problems,” Bracha said. .
The researchers now plan to use this computational technique on the data Bracha generated for voltage-marking proteins.
“Dozens of labs have been working on this problem for 20 years, and there’s still nothing better,” she says. “Now that we have a smaller data set, we hope to be able to actually train the model and make better predictions than the last 20 years of manual testing.”
This research was supported in part by the National Science Foundation, the Machine Learning Consortium for Pharmaceutical Discovery and Synthesis, the Abdul Latif Jameel Health Machine Learning Clinic, the DTRA Medical Countermeasures Discovery Program for New and Emerging Threats, the DARPA Accelerated Molecular Discovery Program, and Sanofi Computational Antibody Design. Grant, US Naval Research Laboratory, Howard Hughes Medical Institute, National Institutes of Health, K. Lisa Yang ICoN Center, K. Lisa Yang, and Hock E. Tan Center for Molecular Therapeutics at MIT.