Let be Clear About the Capabilities of Generative Models and Dialogue Agents

(T) “chatGPT has taken the world by storm”. That is what everyone of us has been reading everywhere and everyday for over two months now about chatGPT. And, everyone of us has been discussing over the last two months, what chatGPT can do (log of best prompts around the Webprompts per persona) or cannot do, who has blocked its usage and why (see for instance Satckoverflow-ChatGPT is banned and NYC education department blocks ChatGPT on school devices), and how to detect it?

Is chatGPT a tipping point in our journey toward artificial general intelligence (AGI)?

Probably not for deep learning scientists and engineers as chatGPT does not bring any new significant innovation, compared to previous generative language models such as its twin model from OpenAI’s Instruct GPT, DeepMind’s Sparrow, or probably new models such as Character.ai, a Silicon Valley start-up that was founded by a team of Xooglers that led the development of Google’s Meena and LaMDA, or Claude from Anthropic, a start-up founded by ex-OpenAI’s researchers and engineers. The challenge for the Character.ai and Anthropic’s teams is to build a large scale machine learning platform that can meet the training requirements of large scale data pipelines (GPT-3 training is around $4 milions).

But yes for the general public, chatGPT is a tipping point! Everyone of us can try to play with the chatGPT’s prompt without having to pay a lot $$$ to use OpenAI’s GPT-3 APIs. Thanks to Microsoft which is funding for OpenAI the usage of chatGPT on Azure without any advertising. We do not know for how long letting everyone experimenting chatGPT will last, but Microsoft will definitely try to monetize soon chatGPT, and the race between Google and OpenAI/Microsoft is just starting. Note also that the search engine started by Richard Socher, has already integrated a you.chat in its search engine you.com, although its capabilities do not seem to match yet those of chatGPT.

Let’s rewind State of AI today in a few quotes..

Andrew Ng: “Artificial Intelligence is the New Electricity

Chris Manning: “Electricity is the new AI?”

Andrej Karpathy: “Github Copilot has dramatically accelerated my coding, it’s hard to imagine going back to “manual coding”. Still learning to use it but it already writes ~80% of my code, ~80% accuracy. I don’t even really code, I prompt. & edit

Gary Marcus: “To achieve artificial general intelligence, AI needs to excel at more than just learning…AI needs to get better at abstraction

Chris Manning: “One of the philosophical questions is whether a computer can ever truly understand happiness or despair without an emotional response system as humans have. I’m not sure of the answer to that question

What about chatGPT and language models?

Gary Marcus and Ernest Davis – chatGPT…:”has been trained to optimize the goal of producing one plausible-sounding word after another rather than actually engage with the meaning of language

Christopher Potts: “We don’t presently know of any compelling arguments that language models are incapable of achieving language understanding. I defined understanding in terms of robustly acquiring very complex language capabilities, rather than in terms of human-like capacities or behaviors, but I think the same basic considerations apply in either case

Let’s get a little bit more technical on what are neural networks are and how GPT-3 works

Ilya Sutskever on OpenAi”s codex: “I want to point out how insane it is that what we are showing to you works at all. It is fundamentally impossible to build such a a system except by training a large neural network to do really good code autocomplete. and, that is all we did. It is very simple conceptually although perhaps not in practice to just set up a large neural network, which is a large digital brain which has a mathematically sound learning procedure. And that part can be understood and it is relatively simple. And then you make the neural network work, and you train it on code autocomplete and by being good enough at code autocomplete we get the capabilities that you are seeing here, it actually reads all the letters and all the words that we are giving to it, and it chews and digests them inside its neural activations, and then it generates the code that we see and because the autocomplete is so accurate, the code actually runs and it runs correctly

Judea Pearl: “That sounds like sacrilege, to say that all the impressive achievements of deep learning amount to just fitting a curve to data. From the point of view of the mathematical hierarchy, no matter how skillfully you manipulate the data and what you read into the data when you manipulate it, it’s still a curve-fitting exercise, albeit complex and nontrivial. …To build truly intelligent machines, teach them cause and effect

Kevin Murphy: “Much of current machine learning focuses on the task of mapping inputs to outputs (i.e., approximating functions of the form f : X → Y), often using “deep learning”. Judea Pearl, a well known AI researcher, has called this “glorified curve fitting”. This is a little unfair, since when X and/or Y are high-dimensional spaces — such as images, sentences, graphs, or sequences of decisions/actions — then the term “curve fitting” is rather misleading, since one-dimensional intuitions often do not work in higher-dimensional settings Nevertheless, the quote gets at what many feel is lacking in current attempts to “solve AI” using machine learning techniques, namely that they are too focused on prediction of observable patterns, and not focused enough on “understanding” the underlying latent structure behind these patterns

Francois Cholet in conversation with Lex Friedman:

  • FC: GPT-3 can learn new tasks after just been shown a few examples but I am not convinced that it is doing that, it is very likely given the amount of its training data that what it is actually doing is pattern matching a new task with a task that it has been exposed in its training data, it is just recognizing the task instead of just developing a model for a new task
  • LF: Is it possible to see GPT-3 as like a prompt which is given a kind of SQL query into what it learned and so the language is used to query its memory and the neural network is a giant memorization “thing”, and if it gets sufficiently giant it will memorize sufficiently large amounts of world data where it becomes that it intelligence is equivalent to a querying machine
  • FC: A significant chunk of its intelligence is its giant associative memory, but I do not believe that intelligence is just a giant memory but it may be a big component
  • LF: Do you think GPT-n will reason – oops – that’s a bad question – a better question – what do you think is the ceiling of GPT-n?
  • FC: I believe that GPT-n will improve on the strength of GPT-2 and GPT-3 in generating more plausible text in context. If you train your model on more data, your model becomes more context-aware and increasingly more plausible, but I do not believe that more transformer layers and more training data will address the flaws of GPT-3 that it can generate plausible text but that text is not constrained any anything else other than plausibility so in particular it is not constrained by factualness or even consistency which is why it is very easy to get GPT-3 to generate statements that are factually untrue or to generate statements that are even self-contradictory because its only training goal is plausibility and it has no constraints, it not constrained to be self-consistent for instance so you can present the answer it will give you by asking the question in specific way because it is very responsive to the way you ask the question, since it has no understanding of the content of the question, and if you ask the same question in two different ways that are basically adversarially engineered to produce certain answers, you will get two different answers that are in contradiction
  • LF: It is very sensitive to adversarial attacks
  • FC: In general, the problem with these generative models is that they are very good at generating plausible text but that is just not enough. One interesting avenue would be to make possible to write programs over the latent spaces that these models operate on; you will rely on these self-supervised models to generate a pool of knowledge and concepts and you will write explicit reasoning programs over it 
  • FC: If you try to use GPT-3 to generate programs, it will perform well for any program that it has seen in its training data but because program space is not interpretative it will not be able to generalize to problems it has not seen before 
  • FC: If you want to use generative models in real products, the quality of the training data is important – high quality as factual and as unbiased – assuming there is such thing as unbiased data – quality training data with human labels is better than large noisy data, more noisy data will improve the model predictions but you will quickly reach diminishing returns

Let’s recap what are language models?

  1. Language models, also called foundation models, are used in natural language understanding/processing (NLU/NLP) applications
  2. They generally predict the conditional probability of the next word (or token = {character, sub-word, or word}) in a sentence given the k previous tokens
  3. Language models have evolved from continuous word representations which implemented Word2vec (2013), to contextual word representations based on recurrent neural networks (RNNs) (2014), to bidirectional contextual word representations based on Google’s transformer architecture (2017)
  4. Scaling language models (e.g. increasing the number of layers and the number of parameters of the neural network) to large language models (LLMs) has significantly helped improving the accuracy of their predictions
  5. LLMs are self-supervised e.g. they learn the labels (e.g. what should be the right next word) by themselves, and are trained on very large datasets of corpus text using generally adam or stochastic gradient descent
  6. All LLMs leverage now Google’s transformer architecture that was proposed in this famous paper “Attention Is All You Need” – self-attention is the mechanism that helps the model to learn the relationships between all words in a sentence, regardless of their respective positions – as an example the meaning of “bank” in the following sentence “I arrived at the bank after crossing the…” is dependent of the last word which could be “road” or “river”
  7. A transformer is a stack of deep “self-attention” layers. One architecture of a transformer is a sequence-to-sequence model that involves an encoder and a decoder where attention is both used in the encoder and the decoder – as an example for machine translation the input to the encoder will be English sentences and the decoder will output those sentences in French – but transformers can also be encoder-only (such as Google’s BERT) or decoder-only (such as OpenAI’s GPT-3)
  8. Google’s BERT is an autoencoding model that used an encoder-only architecture. Autoencoding models use a bidirectional representation of the sentence (e.g. left and right contexts such as “I went to the river bank” and “I need to go to the bank to drop a check”)
  9. OpenAI’s GPT-3 is an autoregressive model that used a decoder-only architecture. Autoregressive models uses prediction from previous time steps of the decoder to generate the prediction at the current time step.
  10. Google’s BERT is pre-trained, and through transfer learning, can be fine-tuned to a given downstream task (such as sentiment classification, name entity recognition, question answering, or machine translation)
  11. OpenAI’s GPT-3 can learn a new task using a few shot learning, e.g. with only a few examples and without any fine-tuning and any parameter tuning, a capability called in-context learning that we cannot still explain
  12. Google has proposed to reduce the insane cost of training LLMs while still increasing the number of parameters leveraging a sparsely activated mixture-of-experts architecture – that is only a small part of the network (mixture-of-experts) will be activated for a set of inputs
  13. Google’s LLMs predictions have been recently improved using “chain-of-thoughts prompting” in particular for multi-steps math problems
  14. Google’s latest update on its search engine Multitask Unified Model (MUM) used its text-to-text T5 transformer model which is 1,000 times more powerful than BERT. T5 is a sequence-to-sequence model with an encoder-decoder architecture, pre-trained with both supervised and self-supervised learning, before fine tuning. T5 is multi-modal (text and images).
  15. Google’s vision for LLMs is the Pathways Language Model (PaML) which is a single model trained on millions of tasks e.g. massive multi-task learning. PaML is multimodal e.g. the model accepts multiple types of inputs (text, images, and sounds). It has a sparse architecture. And, it implements a chain-of-thoughts prompting. Note that PaML is as well auto-regressive like GPT-3. PaML has 540-billion parameters, and trained on a new Google machine learning platform calls the Pathway system that trained PaML on 6,144 TPUv4 on multiple TPU pods
  16. Recent dialogue agents such as OpenAI’s chatGPT or DeepMind’s Sparrow are fine-tuned with a reinforcement learning (RL) model to improve the model predictions based on human feedback – the feedback provides the reward to the model RL – this enables the agent and the AGI system to be more aligned in its responses with humans
  17. While past research of LLMs has been quite successful on developing models with an insane number of parameters trained on large data sets, much more research is needed to better understand generalization of those models, and in particular compositional generalization
  18. Last, LLMs have emerged over the years as the lingua franca of deep learning models not only for NLU/NLP and computer vision, but also in some subtle ways for RL and graph neural networks (GNNs) applications, and other large-scale deep learning models such as protein folding predictions

From GPT-3 to chatGPT – key materials from OpenAI

A good reading “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources” from Yao Fu, Hao Peng, and Tushar Khot to understand the key features of the source code of GPT 3.5 and as a result those of chatGPT.

Redefining language models for decisions making models based on reinforcement learning

From Sergey Levine, this is probably still a work-in-progress but quite novel…

What precisely is it that a language model actually models? It models people. Most literally, it models the buttons that human beings press on keyboards when sitting in front of a computer. This is obviously not the same as a system that can fluently converse with humans, fulfill commands and tasks, or understand the world. But perhaps in some ways it’s better: if we set aside our expectations that language models should actually produce language and view them through the lens of reinforcement learning and optimal control, it becomes clear that what a language model actually provides us with is a way to model and predict the behavior of humans, and such a predictive model can be tremendously useful in the context of a reinforcement learning system that is tasked with interacting with humans in a goal-directed and intentional manner. Since humans use computers for so many day-to-day tasks, a model that can accurately predict which buttons humans will push on a computer keyboard in fact represents a powerful (if incomplete) model of general human behavior.

This suggests that perhaps language models can serve as the foundation for very powerful conversational agents, which are aware of human tendencies, social norms, and some of the deeper aspects of our psychology, but would require integration with effective reinforcement learning framework that can leverage such models to make effective and goal-directed decisions

Finally, here is how you can build a nano-GPT 3 with Pytorch in Collab in your home office?

Andrej KarKarpathy – Let’s build GPT: from scratch, in code, spelled out:


Note: The picture above are a few elephant seals resting on one of the beaches of the Año Nuevo State Park.

Copyright © 2005-2023 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.