The Transformer Architecture Replacing CNNs?

(T) Most of the innovations in deep learning have come over the last few years from new natural language processing (NLP) techniques. Google’s Transformer with self-attention mechanisms has certainly changed the landscape, and OpenAI’s GPT-3 has certainly made a good use of it.

So, it might not be a big surprise that Google is now experimenting bringing the Transformer architecture in a Vision Transformer (ViT) for image classification:

“While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.”

For more, read Google’s blog article “Transformers for Image Recognition at Scale” and Google’s paper “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” on arXiv.

Note: The picture above is from Palo Alto.

Copyright © 2005-2019 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.