Generative Modeling for Engineering Protein Sequences

(T) In this second post about the applications of deep learning techniques to structural biology research, I will try to describe the state of the art of protein sequencing.

Protein sequencing is the process of determining the amino acid sequence of all or part of a protein. The two legacy methods of protein sequencing are “mass spectrometry and Edman degradation using a protein sequenator. Mass spectrometry methods are now the most widely used for protein sequencing and identification, but Edman degradation remains a valuable tool for characterizing a protein’s N-terminus.”

But now, we do have a third method! Welcome to our new world of artificial intelligence!

Protein sequences can be represented with natural language processing (NLP) techniques or language models, just like French or English, where we have words in a dictionary (amino acids) that are strung together to form a sentence (protein).

This is what Professor Mohammed ALQuraishi from Columbia University called “Protein Linguistic“:

“The space of naturally occurring proteins may occupy a very special “manifold”, one that exhibits a hierarchical organization spanning small fragments to entire domains...

The end result of all this would be the emergence of something resembling a linguistic structure, a grammar that defines the reusable parts and how these parts can be combined to form larger assemblies. Given that this is biology, it’s unlikely to be rigid or minimal. It would be messy and hacky, with many exceptions and ad hoc evolutionary optimizations. But the manifold would be there, potentially discoverable and learnable.”

In 2019, a team of PhDs students and their professors from UC Berkeley published an early work, TAPE (Tasks Assessing Protein Embeddings), which leverages NLP techniques for protein sequence predictions.

TAPE introduced “a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. It curates tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. It benchmarks a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques.

Another team led by Professor George Church at Harvard’s Wyss Institute for Biologically Inspired Engineering and Harvard Medical School (HMS) developed “UniRep, trained on about 24 million protein sequences to enable it to predict sequences and their relationship to features like protein stability, secondary structure, and accessibility of internal sequences to surrounding solvents within proteins it had never seen before” as shown in the picture above.

UniRep, is based on LSTM RNN networks and robustly quantified the effects of single amino acid mutations in eight different proteins with diverse biological functions including enzyme catalysis, DNA binding, molecular sensing. In addition, using the Aequorea victoria green fluorescent protein (GFP) as a model” the research team “tasked UniRep to analyze 64,800 variants of the protein, each carrying 1-12 mutations, which demonstrated that it could accurately anticipate how the distribution and relative burden of mutations changed the protein’s brightness.

Note that Professor ALQurashi was part of the UniRep research team and wrote a blog post on it:”The Future of Protein Science will not be Supervised But it may well be semi-supervised.

In 2020, a team of researchers at Salesforce led by Ali Madin developed ProGen, a language model for protein generation.

ProGen is a conditional transformer language model for predicting amino acid sequences. It had 1.2 billion parameters, and was trained on a dataset of 280 million protein sequences together with conditioning tags that encode a variety of annotation such as taxonomic, functional, and locational in-formation.

By conditioning on these tags, ProGen provides a method for protein generation that can be tailored for desired properties as shown in the picture above:


Some tutorials, background, and recent publications on NLP systems:

NLP 101:

Stanford University’s Winter 2021 NLP class:

Google’s Transformer, BERT, and Attention architectures:

Latest Google’s research in NLP:

OpenAI’s GPT-3:

Note: The picture above is the UniRep latent representation of protein sequence space.

Copyright © 2005-2021 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com