Predicting SARS-CoV-2, HIV, and Influenza Viral Mutations with Language Models

(T) In this third blog post about the applications of deep learning techniques to structural biology research, I will summarize a recent research study on virus mutation.The recent Covid-19 pandemic has evolved in a race where not only vaccines must be delivered to worldwide populations to quickly stop the pandemic but also those vaccines must be effective to new variants of the SARS-CoV-2 virus.

A team of researchers from MIT led by Professor Bonnie Berger proposes in “learning the language of viral evolution and escape” a language model to predict mutations that would enable existing infectious viruses to become even more virulent (note that the researchers submitted a previous version of the paper at NeurIPS 2020: learning mutational semantics).

To that end, they are considering that the immune system responses to viruses can be modeled in the same way as a human language is modeled in Natural Language Processing (NLP) systems that analyzes the pattern of human language for the prediction of words. 

NLP systems and language models, leveraging LSTM, BERT, and transformer architectures, have also been used by various research teams in the engineering of protein sequences.

The genetic of a virus which defines how good it is at infecting a given host, can be interpreted in terms of “grammatical correctness (or correct syntax)”. A successful, infectious virus is grammatically correct while an unsuccessful one is not.

The mutations of a virus can be interpreted in terms of “semantics (or meaning)”. Mutations that result in changes in its surface proteins, and make it invisible to certain antibodies, have altered its meaning. Viruses with different mutations can have different meanings, and a virus with a different meaning may need different antibodies to read it.

A virus that causes infection has a “grammar (or syntax)” that is biologically correct, and it also has a “semantic (or meaning)” to which the immune system does or does not respond.

The end goal is to identify mutations that might let a virus escape an immune system without making it less infectious, that is, mutations that change a virus’ meaning without making it grammatically incorrect.

High semantic change and grammaticality, refers to as constrained semantic change search (CSCS) in the study, help predict viral escape.

The researchers analyzed three virus proteins: inYuenza A hemagglutinin (HA), HIV-1 envelope glycoprotein (Env), and SARS-CoV-2 spike glycoprotein (Spike). All those three proteins are found on the viral surface, are responsible for binding host cells, are targeted by antibodies, and are drug targets.

The researchers trained a separate language model for each protein using a corpus of virus specific amino acid sequences.

The language model is developed with bidirectional LSTMs, and trained on thousands of genetic sequences taken from three different viruses: 45,000 of influenza, 60,000 of HIV, and between 3,000 and 4,000 of SARS-Cov-2. The embedding of the genetic sequences grouped viruses according to how similar their mutations were.

The model generates mutated protein sequences by changing one amino acid at a time. To rank a given mutation, they took a weighted sum of the likelihood that the mutated virus retained an infectious grammar, and the degree of semantic difference between the original and mutated sequence’s embeddings.

The researchers compared their model’s highest-ranked mutations to those of actual viruses according to the area under curve (AUC), where 0.5 is random and 1.0 is perfect. The model achieved 0.85 AUC in predicting SARS-CoV-2 variants that were highly infectious and capable of evading antibodies. It achieved 0.77 AUC and 0.83 AUC respectively for two strains of influenza, and 0.69 AUC for HIV.

Once the model was trained (according to the MIT press release), “the model’s analysis of coronaviruses suggested that a part of the spike protein called the S2 subunit is least likely to generate escape mutations.

The question still remains as to how rapidly the SARS-CoV-2 virus mutates, so it is unknown how long the vaccines now being deployed to combat the Covid-19 pandemic will remain effective. Initial evidence suggests that the virus does not mutate as rapidly as influenza or HIV.

However, the researchers recently identified new mutations that have appeared in Singapore, South Africa, and Malaysia, that they believe should be investigated for potential viral escape (these new data are not yet peer-reviewed).”

I am looking forward to learn how that model could be used and be improved.

We need to develop vaccines and adapt those vaccines to new variants at the speed of light!


MIT press releases and articles:

Some background on NLP systems, language models, RNNs and LSTMs:

Note: The picture above is the viral protein language model led by Professor Bonnie Berger.

Copyright © 2005-2021 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com