
(T) “chatGPT has taken the world by storm”. That is what everyone of us has been reading everywhere and everyday for over two months now about chatGPT. And, everyone of us has been discussing over the last two months, what chatGPT can do (log of best prompts around the Web – prompts per persona) or cannot do, who has blocked its usage and why (see for instance Satckoverflow-ChatGPT is banned and NYC education department blocks ChatGPT on school devices), and how to detect it?
Is chatGPT a tipping point in our journey toward artificial general intelligence (AGI)?
Probably not for deep learning scientists and engineers as chatGPT does not bring any new significant innovation, compared to previous generative language models such as its twin model from OpenAI’s Instruct GPT, DeepMind’s Sparrow, or probably new models such as Character.ai, a Silicon Valley start-up that was founded by a team of Xooglers that led the development of Google’s Meena and LaMDA, or Claude from Anthropic, a start-up founded by ex-OpenAI’s researchers and engineers. The challenge for the Character.ai and Anthropic’s teams is to build a large scale machine learning platform that can meet the training requirements of large scale data pipelines (GPT-3 training is around $4 milions).
But yes for the general public, chatGPT is a tipping point! Everyone of us can try to play with the chatGPT’s prompt without having to pay a lot $$$ to use OpenAI’s GPT-3 APIs. Thanks to Microsoft which is funding for OpenAI the usage of chatGPT on Azure without any advertising. We do not know for how long letting everyone experimenting chatGPT will last, but Microsoft will definitely try to monetize soon chatGPT, and the race between Google and OpenAI/Microsoft is just starting. Note also that the search engine started by Richard Socher, has already integrated a you.chat in its search engine you.com, although its capabilities do not seem to match yet those of chatGPT.
Let’s rewind – State of AI today in a few quotes..
Andrew Ng: “Artificial Intelligence is the New Electricity“
Chris Manning: “Electricity is the new AI?”
Gary Marcus: “To achieve artificial general intelligence, AI needs to excel at more than just learning…AI needs to get better at abstraction“
What about chatGPT and language models?
Gary Marcus and Ernest Davis – chatGPT…:”has been trained to optimize the goal of producing one plausible-sounding word after another rather than actually engage with the meaning of language“
Let’s get a little bit more technical on what are neural networks are and how GPT-3 works
Francois Cholet in conversation with Lex Friedman:
- “FC: GPT-3 can learn new tasks after just been shown a few examples but I am not convinced that it is doing that, it is very likely given the amount of its training data that what it is actually doing is pattern matching a new task with a task that it has been exposed in its training data, it is just recognizing the task instead of just developing a model for a new task
- LF: Is it possible to see GPT-3 as like a prompt which is given a kind of SQL query into what it learned and so the language is used to query its memory and the neural network is a giant memorization “thing”, and if it gets sufficiently giant it will memorize sufficiently large amounts of world data where it becomes that it intelligence is equivalent to a querying machine
- FC: A significant chunk of its intelligence is its giant associative memory, but I do not believe that intelligence is just a giant memory but it may be a big component
- LF: Do you think GPT-n will reason – oops – that’s a bad question – a better question – what do you think is the ceiling of GPT-n?
- FC: I believe that GPT-n will improve on the strength of GPT-2 and GPT-3 in generating more plausible text in context. If you train your model on more data, your model becomes more context-aware and increasingly more plausible, but I do not believe that more transformer layers and more training data will address the flaws of GPT-3 that it can generate plausible text but that text is not constrained any anything else other than plausibility so in particular it is not constrained by factualness or even consistency which is why it is very easy to get GPT-3 to generate statements that are factually untrue or to generate statements that are even self-contradictory because its only training goal is plausibility and it has no constraints, it not constrained to be self-consistent for instance so you can present the answer it will give you by asking the question in specific way because it is very responsive to the way you ask the question, since it has no understanding of the content of the question, and if you ask the same question in two different ways that are basically adversarially engineered to produce certain answers, you will get two different answers that are in contradiction
- LF: It is very sensitive to adversarial attacks
- FC: In general, the problem with these generative models is that they are very good at generating plausible text but that is just not enough. One interesting avenue would be to make possible to write programs over the latent spaces that these models operate on; you will rely on these self-supervised models to generate a pool of knowledge and concepts and you will write explicit reasoning programs over it
- FC: If you try to use GPT-3 to generate programs, it will perform well for any program that it has seen in its training data but because program space is not interpretative it will not be able to generalize to problems it has not seen before
- FC: If you want to use generative models in real products, the quality of the training data is important – high quality as factual and as unbiased – assuming there is such thing as unbiased data – quality training data with human labels is better than large noisy data, more noisy data will improve the model predictions but you will quickly reach diminishing returns“
Let’s recap what are language models?
- Language models, also called foundation models, are used in natural language understanding/processing (NLU/NLP) applications
- They generally predict the conditional probability of the next word (or token = {character, sub-word, or word}) in a sentence given the k previous tokens
- Language models have evolved from continuous word representations which implemented Word2vec (2013), to contextual word representations based on recurrent neural networks (RNNs) (2014), to bidirectional contextual word representations based on Google’s transformer architecture (2017)
- Scaling language models (e.g. increasing the number of layers and the number of parameters of the neural network) to large language models (LLMs) has significantly helped improving the accuracy of their predictions
- LLMs are self-supervised e.g. they learn the labels (e.g. what should be the right next word) by themselves, and are trained on very large datasets of corpus text using generally adam or stochastic gradient descent
- All LLMs leverage now Google’s transformer architecture that was proposed in this famous paper “Attention Is All You Need” – self-attention is the mechanism that helps the model to learn the relationships between all words in a sentence, regardless of their respective positions – as an example the meaning of “bank” in the following sentence “I arrived at the bank after crossing the…” is dependent of the last word which could be “road” or “river”
- A transformer is a stack of deep “self-attention” layers. One architecture of a transformer is a sequence-to-sequence model that involves an encoder and a decoder where attention is both used in the encoder and the decoder – as an example for machine translation the input to the encoder will be English sentences and the decoder will output those sentences in French – but transformers can also be encoder-only (such as Google’s BERT) or decoder-only (such as OpenAI’s GPT-3)
- Google’s BERT is an autoencoding model that used an encoder-only architecture. Autoencoding models use a bidirectional representation of the sentence (e.g. left and right contexts such as “I went to the river bank” and “I need to go to the bank to drop a check”)
- OpenAI’s GPT-3 is an autoregressive model that used a decoder-only architecture. Autoregressive models uses prediction from previous time steps of the decoder to generate the prediction at the current time step.
- Google’s BERT is pre-trained, and through transfer learning, can be fine-tuned to a given downstream task (such as sentiment classification, name entity recognition, question answering, or machine translation)
- OpenAI’s GPT-3 can learn a new task using a few shot learning, e.g. with only a few examples and without any fine-tuning and any parameter tuning, a capability called in-context learning that we cannot still explain
- Google has proposed to reduce the insane cost of training LLMs while still increasing the number of parameters leveraging a sparsely activated mixture-of-experts architecture – that is only a small part of the network (mixture-of-experts) will be activated for a set of inputs
- Google’s LLMs predictions have been recently improved using “chain-of-thoughts prompting” in particular for multi-steps math problems
- Google’s latest update on its search engine Multitask Unified Model (MUM) used its text-to-text T5 transformer model which is 1,000 times more powerful than BERT. T5 is a sequence-to-sequence model with an encoder-decoder architecture, pre-trained with both supervised and self-supervised learning, before fine tuning. T5 is multi-modal (text and images).
- Google’s vision for LLMs is the Pathways Language Model (PaML) which is a single model trained on millions of tasks e.g. massive multi-task learning. PaML is multimodal e.g. the model accepts multiple types of inputs (text, images, and sounds). It has a sparse architecture. And, it implements a chain-of-thoughts prompting. Note that PaML is as well auto-regressive like GPT-3. PaML has 540-billion parameters, and trained on a new Google machine learning platform calls the Pathway system that trained PaML on 6,144 TPUv4 on multiple TPU pods
- Recent dialogue agents such as OpenAI’s chatGPT or DeepMind’s Sparrow are fine-tuned with a reinforcement learning (RL) model to improve the model predictions based on human feedback – the feedback provides the reward to the model RL – this enables the agent and the AGI system to be more aligned in its responses with humans
- While past research of LLMs has been quite successful on developing models with an insane number of parameters trained on large data sets, much more research is needed to better understand generalization of those models, and in particular compositional generalization
- Last, LLMs have emerged over the years as the lingua franca of deep learning models not only for NLU/NLP and computer vision, but also in some subtle ways for RL and graph neural networks (GNNs) applications, and other large-scale deep learning models such as protein folding predictions
From GPT-3 to chatGPT – key materials from OpenAI
- chatGPT:
- Blog article: “Optimizing Language Models for Dialogue“
- Proximal Policy Optimization:
- Blog article: “Proximal Policy Optimization“
- Paper: “Proximal Policy Optimization Algorithms“
- Github: github.com/openai/baselines
- InstructGPT:
- Blog article: “Aligning Language Models to Follow Instructions“
- Codex:
- Blog article: “Powering Next Generation Applications with OpenAI Codex“
- Video: OpenAI Codex Live Demo
- Github Copilot
- GPT-3:
- OpenAI’s API for GPT-3, Codex, and DALL-E
A good reading “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources” from Yao Fu, Hao Peng, and Tushar Khot to understand the key features of the source code of GPT 3.5 and as a result those of chatGPT.
Redefining language models for decisions making models based on reinforcement learning
From Sergey Levine, this is probably still a work-in-progress but quite novel…
Finally, here is how you can build a nano-GPT 3 with Pytorch in Collab in your home office?
Andrej KarKarpathy – Let’s build GPT: from scratch, in code, spelled out:
References
- Toward the Magic Recipe to Make AI Systems “Intelligent”
- State of the Art for NLP Models
- OpenAI’s GPT-3 Insane Number of Parameters!
Note: The picture above are a few elephant seals resting on one of the beaches of the Año Nuevo State Park.
Copyright © 2005-2023 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.
Categories: Artificial Intelligence, Deep Learning