The Latest Nuances of the Transformer Architecture

(T) Two researchers Quentin Fournier and Gaétan Marceau Caron, from the Mila Lab from the University of Montreal led by Yoshua Bengio, and Daniel Aloise from Polytechnique Montreal have surveyed the latest evolutions of the transformer architecture in a new paper: “A Practical Survey on Faster and Lighter Transformers“:

“Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice in order to meet the desired trade-off between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.”

This paper is quite important, and a must-read as many research groups have come up with some variations of the transformer architecture for specific applications.

The newsletter from DeepLearning.ai, the Batch has an excellent summary of it (probably a good reading first before to get deep into the paper):

“Taming Transformers

The transformer architecture is astonishingly powerful but notoriously slow. Researchers have developed numerous tweaks to accelerate it — enough to warrant a look at how these alternatives work, their strengths, and their weaknesses.

What’s new: Quentin Fournier, Gaétan Marceau Caron, and Daniel Aloise surveyed variations on the transformer, evaluating methods designed to make it faster and more efficient. This summary focuses on the variations designed to accelerate it.

The cost of attention: The attention mechanism in the original transformerplaces a huge burden on computation and memory; O(n2) cost where n is the length of the input sequence. As a transformer processes each token (often a word or pixel) in an input sequence, it concurrently processes — or “attends” to — every other token in the sequence. Attention is calculated by multiplying two large matrices of weights before passing the resulting matrix through a soft​​max function. The softmax function normalizes the matrix values to a probability distribution, bringing higher values closer to 1 and lower values near 0. This enables the transformer, when encoding a token, to use relevant tokens and ignore irrelevant tokens. 

(Modified) attention is all you need: The authors identify three approaches to accelerating transformers. Two of them optimize the attention mechanism and the third optimizes other parts of the architecture.

  • Sparse attention. These approaches simplify the attention calculation by using a subset of weights and setting the rest to 0. They mix and match three general patterns in which the position of a given token in a sequence determines how it attends to other tokens: (i) a token attends to all other tokens, (ii) a token attends only to directly neighboring tokens, or (iii) a token attends to a random selection of tokens. For instance, in Star Transformer, the first token attends to all other tokens and the other tokens attend only to neighbors. Calculating attention with sparse matrices is faster than usual thanks to fast sparse matrix multiplication algorithms. However, because it processes only a subset of the original attention weights, this approach degrades performance slightly. Further, because sparse attention patterns are handcrafted, they may not work well with all data and tasks.
  • Factorized attention. Approaches in this category modify attention calculations by approximating individual matrices as the product of two (or more) smaller matrices. This technique enables Linformer to cut memory requirements by a factor of 10 compared to the original transformer. Factorized attention methods outperform sparse attention in some tasks, such as determining whether two dots in an image are connected by a path that consists of dashes. However, they’re less effective in other areas, such as classifying images and compressing long sequences for retrieval. 
  • Architectural changes. These approaches retain the original attention mechanism while altering other aspects of transformer architecture. One example is adding an external memory. With the original transformer, if an input sequence is too long, the model breaks it into smaller parts and processes them independently. Given a long document, by the time it reaches the end, it doesn’t have a memory of what happened at the beginning. Transformer-XL and Compressive Transformer store embeddings of earlier parts of the input and use them to embed the current part. Compared to the original transformer of the same size, Transformer-XL was able to improve its performance based on training examples that were 4.5 times longer.

Yes, but: It’s difficult to compare the results achieved by these variations due to differences in model size and hyperparameters (which affect performance) and hardware used (which affects speed). Further, some transformer variations utilize multiple modifications, making it hard to isolate the benefit of any particular one.

Why it matters: These variations can help machine learning engineers manage compute requirements while taking advantage of state-of-the-art approaches.

We’re thinking: The authors of Long Range Arena built a dashboard that reports performance of various transformers depending on the task. We welcome further efforts to help developers understand the tradeoffs involved in different variations.”

Note, that I would like to add to the content of that survey that paper that proposes to “combine the efficient parallelizable training of Transformers with the efficient inference of RNNs”: “RWKV: Reinventing RNNs for the Transformer Era“.

Note: The picture above is une mare remplie de nenuphars.

Copyright © 2005-2023 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.