research
New AI tool classifies the impact of 71 million ‘missense’ mutations.
Uncovering the root cause of disease is one of the greatest challenges in human genetics. Because millions of mutations can occur and experimental data is limited, it is still a mystery which mutations can cause disease. This knowledge is critical for faster diagnosis and development of life-saving treatments.
Today we are releasing a catalog of ‘missense’ mutations that will allow researchers to learn more about what impact they may have. Missense variants are genetic mutations that can affect the function of human proteins. In some cases, it can lead to diseases such as cystic fibrosis, sickle cell anemia, or cancer.
The AlphaMissense catalog was developed using AlphaMissense, a new AI model for classifying missense variants. In a paper published in Science, we show that out of 71 million possible missense variants, 89% were classified as likely pathogenic or likely benign. In contrast, only 0.1% have been confirmed by human experts.
AI tools that can accurately predict the effects of mutations have the power to accelerate research in a variety of fields, from molecular biology to clinical and statistical genetics. Experiments to identify disease-causing mutations are expensive and difficult. Every protein is unique and each experiment must be designed separately and can take months. Using AI predictions, researchers can preview results for thousands of proteins at once, which can help them prioritize resources and accelerate more complex studies.
We make all predictions available for free to the research community and release AlphaMissense’s model code as open source.
What is a missense variant?
Missense variants are single-letter substitutions in DNA that result in a different amino acid within a protein. If you think of DNA as a language, changing one letter can change a word and completely change the meaning of a sentence. In this case, the substitution may change the amino acid being translated, affecting the function of the protein.
The average person has more than 9,000 missense variants. Most are benign and have little or no effect, but others are pathogenic and can seriously disrupt protein function. Missense variants can be used in the diagnosis of rare genetic diseases where a few or even a single missense variant can directly cause the disease. It is also important for studying complex diseases such as type 2 diabetes, which can be caused by a combination of different types of genetic changes.
Classifying missense variants is an important step in understanding which of these protein changes may cause disease. Of the more than 4 million missense variants already discovered in humans, only 2% have been classified by experts as pathogenic or benign. This represents about 0.1% of the total 71 million missense variants. The remainder are considered ‘variants of unknown significance’ due to a lack of experimental or clinical data on their effects. With AlphaMissense, we can now get the clearest picture to date by classifying 89% of variants in a database of known disease variants using a threshold that yields 90% precision.
Pathogenic or Benign: How AlphaMissense Classifies Variants
AlphaMissense is based on AlphaFold, a groundbreaking model that predicts the structure of virtually every protein known to science from its amino acid sequence. Our adapted model can predict the pathogenicity of missense variants that change individual amino acids in proteins.
To train AlphaMissense, we fine-tuned AlphaFold on labels that distinguish variants seen in a group of closely related primates to humans. Variants that are commonly found are treated as benign, while variants that are not found at all are treated as pathogenic. AlphaMissense does not predict changes in protein structure due to mutations or other effects on protein stability. Instead, it leverages a database of related protein sequences and the structural context of the variant to generate a score between 0 and 1 that roughly assesses the likelihood that the variant is pathogenic. Continuous scoring allows users to select a threshold for classifying a variant as pathogenic or benign consistent with accuracy requirements.
AlphaMissense achieves state-of-the-art predictions across a wide range of genetic and experimental benchmarks without explicit training on these data. Our tool outperformed other computational methods when used to classify variants in ClinVar, a public data archive on the relationships between human variants and diseases. Our model was also the most accurate way to predict laboratory results, showing consistency with a variety of methods for measuring pathogenicity.
Building community resources
AlphaMissense builds on AlphaFold to advance the world’s understanding of proteins. A year ago, we published 200 million protein structures predicted using AlphaFold. This helps millions of scientists around the world accelerate their research and pave the way for new discoveries. We look forward to seeing how AlphaMissense can help address open questions across genomics and biology.
We have made AlphaMissense’s predictions freely available to the scientific community. Together with EMBL-EBI, we are making Ensembl Variant Effect Predictor more useful to researchers.
In addition to the missense mutation lookup table, we have shared extended predictions for all 216 million single amino acid sequence substitutions across more than 19,000 human proteins. We also included the average prediction for each gene, which is similar to measuring the evolutionary constraints of a gene. This indicates how essential genes are to the survival of an organism.
Accelerating genetic disease research
A key step in translating this research is collaboration with the scientific community. We have been working with Genomics England to explore how these predictions could help study the genetics of rare diseases. Genomics England cross-referenced AlphaMissense’s findings with previously aggregated variant pathogenicity data from human participants. Their evaluation confirmed that our predictions were accurate and consistent, providing another real-world benchmark for AlphaMissense.
Although our predictions are not designed for direct clinical use and should be interpreted in conjunction with other sources of evidence, this study has the potential to help improve the diagnosis of rare genetic disorders and discover new disease-causing genes.
Ultimately, we hope that AlphaMissense can be used in conjunction with other tools to help researchers better understand diseases and develop new, life-saving treatments.