State of the Art for NLP Models

(T) NLP models are still the “big thing” in machine learning even if they cannot understand the meaning of the text or the speech that they are supposed to decipher for us, and even if we simply do not know how they work. But as the research efforts and the funding from large tech companies into that field are phenomenal, those models that are in most cases, complex self-supervised models pre-trained with a massive amount of data, are improving at the speed of light.

Here is a quick summary of the state of the art:

Stanford University’s faculty and students call them foundation models and developed last year a long report (212 pages to read!) about them:

  • AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.”

Google’s NLP researcher Sebastian Ruder earlier this year, summarized the key research areas for NLP in his blog around the following themes:

  • Introduction
  • Best practices
  • Word embeddings
  • Depth
  • Layer connections
  • Dropout
  • Multi-task learning
  • Attention
  • Optimization
  • Ensembling
  • Hyperparameter optimization
  • LSTM tricks
  • Task-specific best practices
  • Classification
  • Sequence labelling
  • Natural language generation
  • Neural machine translation

Jeff Dean, the tech lead of the Google Research and Health Teams, and probably one of the most famous Silicon Valley engineers, shared some of the key achievements of his team for 2021 in a recent blog post:

  • “Researchers are training larger, more capable machine learning models than ever before. For example, just in the last couple of years models in the language domain have grown from billions of parameters trained on tens of billions of tokens of data (e.g., the 11B parameter T5 model), to hundreds of billions or trillions of parameters trained on trillions of tokens of data (e.g., dense models such as OpenAI’s 175B parameter GPT-3 model and DeepMind’s 280B parameter Gophermodel, and sparse models such as Google’s 600B parameter GShard model and 1.2T parameter GLaM model). These increases in dataset and model size have led to significant increases in accuracy for a wide variety of language tasks, as shown by across-the-board improvements on standard natural language processing (NLP) benchmark tasks (as predicted by work on neural scaling laws for language models and machine translation models). Many of these advanced models are focused on the single but important modality of written language and have shown state-of-the-art results in language understanding benchmarks and open-ended conversational abilities, even across multiple tasks in a domain. They have also shown exciting capabilities to generalize to new language tasks with relatively little training data, in some cases, with few to no training examples for a new task. A couple of examples include improved long-form question answeringzero-label learning in NLP, and our LaMDA model, which demonstrates a sophisticated ability to carry on open-ended conversations that maintain significant context across multiple turns of dialog. Transformer models are also having a major impact in image, video, and speech models, all of which also benefit significantly from scale, as predicted by work on scaling laws for visual transformer models. Transformers for image recognition and for video classification are achieving state-of-the-art results on many benchmarks, and we’ve also demonstrated that co-training models on both image data and video data can improve performance on video tasks compared with video data alone. We’ve developed sparse, axial attention mechanisms for image and video transformers that use computation more efficiently, found better ways of tokenizing images for visual transformer models, and improved our understanding of visual transformer methods by examining how they operate compared with convolutional neural networks. Combining transformer models with convolutional operations has shown significant benefits in visual as well as speech recognition tasks.”

Two papers that I found mind-blowing from last year were from a group of researchers from UC Berkeley, Facebook AI, and Google Brain. The first paper shows how a pre-trained transformer for NLP applications can generalize to non-NLP tasks such as numerical computation, vision, and protein fold prediction. The second paper re-imagines the typical architecture for reinforcement learning based on reward, policies, and actions as a sequence model based on a transformer that outputs the actions for the model:

  • Pretrained Transformers as Universal Computation Engines:We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning — in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.”
  • Decision Transformer: Reinforcement Learning via Sequence Modeling:We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

And last, here is a wonderful blog post from Eric Jang, a Google’s robotics researcher who boldly argues that “to understand language is to understand generalization“:

  • “We all want ML models to generalize better, but defining “generalization” is hard. I suggest that the structure of language is the structure of generalization. If language models also capture the underlying structure of generalization vis-à-vis language, then perhaps we can use language models to “bolt generalization” onto non-verbal domains, such as robotics.
  • ...In my essay “Just ask for Generalization”, I argued that some optimization capabilities, such as reinforcement learning from sub-optimal trajectories, might be better implemented by generalization than by construction. We have to generalize to unseen situations at deployment time anyway, so why not focus on generalization capability as the first class citizen, and then “just ask for optimality” as an unseen case?
  • the aforementioned properties of generalization we seek can be understood as nothing more than the structure of human language. Before you think “ew, linguistics” and close this webpage, I promise that I’m not advocating for hard-coding formal grammars as inductive biases into our neural networks. To the contrary, I argue that considering generalization as being equivalent to language opens up exciting opportunities to scale up non-NLP models the way we have done for language.”

Note: The picture above is Le Baiser from Camille Claudel.

Copyright © 2005-2022 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com