(T) In the next three blog posts, I will try to summarize the present state of the art regarding the applications of deep learning techniques to structural biology research, and in particular 1) protein structure, 2) protein sequencing, and 3) virus mutation, a subject of prime interest due to the recent deadly evolutions of the SARS-CoV-2.
DeepMind solved the protein structure prediction problem for single protein chains at CASP14.
Let’s try to understand how DeepMind did solve the prediction of protein structures from the basics of what are proteins to the (known) details of DeepMind’s deep learning system.
What are proteins?
“Proteins are large biomolecules, or macromolecules, consisting of one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another. Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the nucleotide sequence of their genes, and which usually results in protein folding into a specific 3D structure that determines its activity.”
“The sequence of amino acid residues in a protein is defined by the sequence of a gene, which is encoded in the genetic code, which is the set of rules used by living cells to translate information encoded within genetic material (DNA or mRNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links proteinogenic amino acids in an order specified by messenger RNA (mRNA), using transfer RNA (tRNA) molecules to carry amino acids and to read the mRNA three nucleotides at a time. The genetic code is highly similar among all organisms and can be expressed in a simple table with 64 entries.“
“Protein folding is the physical process by which a protein chain acquires its native three-dimensional structure, a conformation that is usually biologically functional, in an expeditious and reproducible manner. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from a random coil. Each protein exists as an unfolded polypeptide or random coil when translated from a sequence of mRNA to a linear chain of amino acids. This polypeptide lacks any stable (long-lasting) three-dimensional structure (the left hand side of the first figure). As the polypeptide chain is being synthesized by a ribosome, the linear chain begins to fold into its three-dimensional structure.”
Protein folding – Source wikipedia
Protein Structure Prediction and the CASP
Protein structure prediction is one of the most important goals pursued by computational biology. A protein’s shape is closely linked with its function, and the ability to predict this structure unlocks a greater understanding of what it does and how it works. Many of the world’s greatest challenges in medicine, like developing treatments for diseases or in biotechnology like finding enzymes that break down industrial waste, are fundamentally tied to proteins, and the role they play.
“The CASP (Critical Assessment of protein Structure Prediction) is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users.”
DeepMind’s approach to the protein folding problem
“We (DeepMind) first entered CASP13 in 2018 with our initial version of AlphaFold, which achieved the highest accuracy among participants. Afterwards, we published a paper on our CASP13 methods in Nature with associated code, which has gone on to inspire other work and community-developed open source implementations. Now, new deep learning architectures we’ve developed have driven changes in our methods for CASP14, enabling us to achieve unparalleled levels of accuracy. These methods draw inspiration from the fields of biology, physics, and machine learning, as well as of course the work of many scientists in the protein folding field over the past half-century.
A folded protein can be thought of as a “spatial graph”, where residues are the nodes and edges connect the residues in close proximity. This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history. For the latest version of AlphaFold, used at CASP14, we created an attention-based neural network system, trained end-to-end, that attempts to interpret the structure of this graph, while reasoning over the implicit graph that it’s building. It uses evolutionarily related sequences, multiple sequence alignment (MSA), and a representation of amino acid residue pairs to refine this graph.
By iterating this process, the system develops strong predictions of the underlying physical structure of the protein and is able to determine highly-accurate structures in a matter of days. Additionally, AlphaFold can predict which parts of each predicted protein structure are reliable using an internal confidence measure.
We trained this system on publicly available data consisting of ~170,000 protein structures from the protein data bank together with large databases containing protein sequences of unknown structure. It uses approximately 16 TPUv3s (which is 128 TPUv3 cores or roughly equivalent to ~100-200 GPUs) run over a few weeks, a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today. As with our CASP13 AlphaFold system, we are preparing a paper on our system to submit to a peer-reviewed journal in due course.“
An overview of DeepMind deep learning model
The model operates over evolutionarily related protein sequences as well as amino acid residue pairs, iteratively passing information between both representations to generate a structure
AlphaFold was successful to predict the structure of five understudied SARS-CoV-2 proteins: Nsp2, Nsp4, Nsp6, and Papain-like proteinase (C terminal domain).
Following is a video by DeepMind about the story of how DeepMind developed AlphaFold…
To deeper dive into AlphaFold
To deeper dive into AlphaFold’s machine learning algorithm
- AlphaFold2 @ CASP14: “It feels like one’s child has left home” by Mohammed AlQuraishi
- AlphaFold @ CASP13: “What just happened?” by Mohammed AlQuraishi
- AlphaFold 2 & Equivariance by Justas Dauparas & Fabian Fuchs
Note: The picture above is from DeepMind and represents two examples of protein targets, and their experimental and computational results.
Copyright © 2005-2021 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com