OpenAI’s GPT-3 Insane Number of Parameters!


(T) I finally had a chance to start studying OpenAI’s GPT-3. GPT-3, like GPT-2 provides many natural language processing (NLP) tasks, and in particular reading comprehension, summarization, translation and question answering.

GPT-3 and GPT-2, shared many characteristics with Google’s Transformer which introduced attention mechanisms, that directly models relationships between all words in a sentence, and Google’s BERT which added pre-trained models to the Transformer architecture. Those pre-trained models can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.

All those models are implemented usually with massively deep recurrent (RNNs) and convolutional neural networks (CNNs).

Google’s initial Transformer architecture uses a stack of encoders and decoders, while GPT uses a multi-layer transformer decoder architecture, and BERT (and I believe other Google’s BERT developments – Transformer-XL and XLNet) uses a multi-layer transformer encoder architecture.

Another difference between BERT and GPT is that BERT implements bidirectional representations of unlabeled text e.g. right to left and left to right in a sentence.

So here are what seem to be the key innovations and takeaways of GPT-3 versus in particular GPT-2:

GPT-3 has no less than 175 billion parameters! Yes, 175 billion parameters! For comparison, the largest version of GPT-2 had 1.5 billion parameters, and the world’s largest transformer-based language model — introduced by Microsoft earlier in May — has 17 billion parameters.

GPT-3 demonstrates that a language model trained on enough data can solve a NLP task that it has never encountered before. The model learns a new task by seeing zero, one, or a few examples for that task. But there is no need anymore to do gradient and parameter updates, or fine tuning for using the model for those new tasks.

You can just interact with the model using natural language and/or provide some examples of the task that you are trying to do, and the model will do it!

GPT-3 shows that language model performance scales as a power-law of model size, dataset size, and the amount of computation.


Number of examples in context K (from the paper)

But the cost of training those NLP models is significantly increasing, and certainly at a faster rate than GPU memory, requiring many implementations of innovative data and model parallelization techniques.


Training he 175 millions GPT-3 parameters on a single GPU instance costs $4.3 M. Assuming now that the brain has 100 trillion parameters, how much it will cost to train a language model the size of the human brain?

MIT’s Professor Lex Fridman has the answer for that question:


Note: The picture above is “Oben und Links” from Wassily Kandinsky

Copyright © 2005-2020 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.