Teaching Common Sense and Maths to LLMs

(T) The most common dissatisfaction of users with large language models (LLMs) is hallucination. In addition, LLMs are not good at maths. Here are a few resources to explore the state of the art to improve LLMs in those two areas…

Common sense:

OpenAI’s GPT-3 and Google’s PaLM 2/PaLM models can learn a new task using a few shot learning, e.g. with only a few examples and without any fine-tuning and any parameter tuning, a capability called in-context learning that we cannot still explain.

However LLMs have shown uncertainty and hallucinating in order to complete their responses.

Both OpenAI and Google have proposed a few approaches to improve the correct responses and mitigate the wrong responses to human dialogues with a LLM, but we are far away from “the Magic Recipe to Make AI Systems “Intelligent”.

Present approaches for GPT-3 and PaLM 2/PaLM are as of today:

  • Reinforcement learning from human feedback (RLHF) (OpenAI’s approach)
  • Fine tuning the model predictions first and improving it with RLHF second (Google’s approach)

Few new approaches have been proposed beyond what OpenAI and Google have proposed so far:

  • Knowledge distillation and knowledge graph (An approach from Professor Yejin Choi and her team at the University of Washington)
  • A combination of a supervised model with a reinforcement learning model that improves the prediction of the self-supervised LLM based on a set of principles (Anthropic‘s constitutional AI) and reduces the effort of the human labelers


  • ML models have been extremely successful at playing games even better than humans such as chess (IBM’s Deepblue) and Go (DeepMind) or generating and debugging source codes (OpenAI’s Copilot, Google’s Bard) where rules are very “formal” e.g. “small to large search space with finite action space”
  • But generating solutions to complex “informal/intuitive” math problems e.g. “large search space with infinite action space” is still an area for research although a model/neural theorem prover has been shown to solve a variety of challenging high-school olympiad problems

Teaching common sense

OpenAI – Reinforcement Learning from Human Feedback: Progress and Challenges

Dialogue agents such as OpenAI’s chatGPT or DeepMind’s Sparrow are fine-tuned with a reinforcement learning (RL) model to improve the model predictions based on human feedback. Human labelers are leveraged to build a “preferred conversation” that is used as a reward for a reinforcement learning model. This enables the agent/chatbot to be more aligned in its responses with humans.

The three papers of OpenAI on RLHF are “Learning to summarize from human feedback“:

“As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about — summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.”

Training language models to follow instructions with human feedback“:

“Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.”

And, “Scaling Laws for Reward Model Overoptimization“:

In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart’s law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed “gold-standard” reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.

Following is a talk at Berkeley from John Schulman who leads the development of RL algorithms at OpenAI on possible areas of research to improve the LLMs predictions:

Note that John believes that the model knows about the uncertainty of its responses, has an internal knowledge graph, and the best way to mitigate the model uncertainty is through RL. John does not believe in behavior cloning based on a human knowledge graph.

Google – Fine Tuning Language Models

Google has proposed to use “instruction tuning” to train the LLM, an approach that it calls FLAN (Finetuned Language Models Are Zero-Shot Learners) (FLAN blog postFLAN paper). Instruction tuning basically fine tune the model, through its prompt, with a number of NLP tasks, so that the model predictions in inference are improved on an unseen task or zero shot learning.

FLAN is used for Bard powered by LaMDA, and probably also by Bard powered by PaLM 2:

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning — finetuning language models on a collection of tasks described via instructions — substantially improves zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Following is what Google said about FLAN and RLHF in its description of Bard based on LaMDA:

“Our early work on instruction fine-tuning on demonstrated that fine-tuning with a relatively small amount of human assistance and feedback, as well as additional engineering, provided in various forms (e.g., fine-tuning, well-designed prompt engineering and user prompting, corrections or modeling of what a high-quality response would look like, or even users simply giving thumbs up or down) can help a model learn and improve. So if responses are flagged in Bard, trained human reviewers look at them to assess their quality related to the input prompt and determine if Bard’s response is low-quality, inaccurate or harmful. From there, trained evaluators suggest higher-quality responses in line with a defined set of policies, and these are then used as fine-tuning data to provide Bard a better dataset to learn from so it can produce improved responses in the future. To further improve Bard, we use a technique called Reinforcement Learning on Human Feedback (RLHF), which improves LLMs based on human preference feedback. And
while we’ve learned a lot through the and our programs, the next critical step in meaningfully improving Bard is getting a wider range of experts’ and users’ feedback and evaluation.”

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

Professor Yejin Choi and her team at the University of Washington have worked on an interesting approach to improve the capabilities of LLMs with a framework called “symbolic knowledge distillation” that leverages Professor Hinton‘s research “knowledge distillation“.

The approach is to generate a smaller model (LM) that has more common sense than a larger one (LLM), called the teacher model, while even surpassing human knowledge. The smaller model, called the critical teacher model, is generated from “knowledge distillation” from GPT-3. And, its predictions are filtered with a classifier based on human inputs.

Symbolic knowledge distillation demonstrates according to its authors that “humans and LMs can be effective collaborators for curating commonsense knowledge graphs and training efficient and performant commonsense models.

The common practice for training commonsense models has gone from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in order to train commonsense models. In this work, we investigate an alternative, from-machine-to-corpus-to-machine: general language models author these commonsense knowledge graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation (Hinton et al., 2015), our approach uses larger models to teach smaller models. A key difference is that we distill knowledge symbolically-as text-in addition to the neural model. We also distill only one aspect-the commonsense of a general language model teacher, allowing the student to be a different type, a commonsense model. Altogether, we show that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense from GPT-3, a general language model. Empirical results demonstrate that, for the first time, a human-authored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity. In addition, it results in a neural commonsense model that surpasses the teacher model’s commonsense capabilities despite its 100x smaller size. We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.”

Here is an IEEE article and video about symbolic knowledge distillation and a video about Professor Choi’s paper from Yannick Kichler:

Professor Choi did recently an easy-to-listen talk “Why AI Is Incredibly Smart and Shockingly Stupid” at TED:

Anthropic – Constitutional AI

Anthropic has recently proposed to supervise the harm of the LLM with another model based on a set of principles called “constitutional AI“:

Anthropic’s approach is to partially replace and improve “RLHF” with “RLAIF” (e.g. RL “AI” feedback) with the following steps:

  • First, a supervised learning model that reviews the responses from harmful prompts given to the LLM
  • Second, a reinforcement learning model that uses a data set of preferences for harmlessness responses to fine tune the original supervised learning model

“As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use ‘RL from AI Feedback’ (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

Anthropic Constitution AI Models

Teaching maths

Capabilities of LLMs in maths

Following is a talk at the College de France from ex-OpenAi researcher Stanislas Polu on the capabilities of LLMs in maths:

Stanislas believes that the best area of research is where humans generate the maths problems and the model generate the solutions to those problems, solutions which are then reviewed by humans and input as new training data to the model (this sounds not too far from RLHF?). Stanislas also believes that having a model trained on “informal” math questions will help the model to better solve “formal” math question than if the model has just been trained on “formal” math data sets.

Formal and Informal Maths Problems

Stanislas led a neural theorem prover (blog post, paper) at OpenAi to solve math olympiad problems:

We built a neural theorem prover for Lean that learned to solve a variety of challenging high-school olympiad problems, including problems from the AMC12 and AIME competitions, as well as two problems adapted from the IMO. These problems are not standard math exercises, they are used to let the best high-school students from the US (AMC12, AIME) or the world (IMO) compete against each other. The prover uses a language model to find proofs of formal statements. Each time we find a new proof, we use it as new training data, which improves the neural network and enables it to iteratively find solutions to harder and harder statements.”

Here is an interview of Stanislas by Yannick:

Here is Yannick’s video on Stanisla’s paper:

And, following is an initial paper from Stanislas on “Generative Language Modeling for Automated Theorem Proving“:

“We explore the application of transformer-based language models to automated theorem proving. This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans — the generation of original mathematical terms — might be addressable via generation from language models. We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance. GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep-learning based system has contributed proofs that were adopted by a formal mathematics community.”

Google PaLM 2 on reasoning and math

Google technical report on PaLM has a section on reasoning:

The ability of large models to reason, to combine multiple pieces of information, and to make logical inferences is one of their most important capabilities. We evaluate PaLM 2’s reasoning capabilities on representative reasoning datasets in a few-shot setting including WinoGrande (Sakaguchi et al., 2021), ARC-C (Clark et al., 2018), DROP (Dua et al., 2019), StrategyQA (Geva et al., 2021), CommonsenseQA (CSQA; Talmor et al., 2019), XCOPA (Ponti et al., 2020), and BIG-Bench (BB) Hard (Suzgun et al., 2022). We compare to PaLM, GPT-4 (OpenAI, 2023b), and the state of the art (SOTA) for each dataset.5 We employ the instruction-tuned version of PaLM 2 (see Appendix A.2 for the detailed instruction tuning results) except for the multilingual XCOPA dataset.

PaLM 2 outperforms PaLM across all datasets and achieves results competitive with GPT-4. On the multilingual XCOPA dataset, PaLM 2 achieves particularly strong improvements on under-represented languages such as Swahili, Quechua, and Haitian and establishes a new state of the art even without chain-of-thought prompting (Wei et al., 2022) (see Appendix A.3 for the detailed results). On BIG-Bench Hard, PaLM 2 outperforms PaLM on every task, often by a large margin. We discuss improvements on the challenging BIG-Bench Hard tasks below.”

and on math reasoning:

“We evaluate PaLM 2 on MATH (Hendrycks et al., 2021), which contains 12,500 problems from high school competitions in 7 mathematics subject areas, GSM8K (Cobbe et al., 2021), a dataset of 8,500 grade school math word problems, and MGSM (Shi et al., 2023), a multilingual version of GSM8K with translations of a subset of examples into ten typologically diverse languages. We compare PaLM 2 to PaLM, Minerva (Lewkowycz et al., 2022), GPT-4 (OpenAI, 2023b), and the state of the art for each dataset.

For MATH, we follow Lewkowycz et al. (2022) and use the same 4-shot chain-of-thought prompt, combined with self-consistency (Wang et al., 2023) utilizing 64 sample paths. For GSM8K, we use the same 8-shot chain-of-thought prompt as in (Wei et al., 2022), and self-consistency with 40 sample paths. We use the SymPy library (Meurer et al., 2017) to compare answers and guard against false negatives, which arise from equivalent answers with different surface forms. For MGSM, we use 8-shot chain-of-thought prompts and in-language exemplars provided by Shi et al. (2023).

We show the results in Table 7. PaLM 2 outperforms PaLM dramatically on all datasets. On MATH, PaLM 2 is competitive with the state-of-the-art performance achieved by the dedicated Minerva model. On GSM8K, PaLM 2 outperforms Minerva and GPT-4 while on MGSM, it surpasses the state of the art even without self-consistency.”

Note: The picture above is from Eataly at the Prudential Center in Boston.

Copyright © 2005-2023 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com