Grab some popcorn 🍿! Today we’ll check how OpenAI turned the monster one gets from training on raw internet data into something useful. We’ll walk through three key stages:

Text-continuation model (GPT-3, June 2020)
Instruction-following aligned model (InstructGPT, Jan 2022)
Chatbot (ChatGPT, November 2022).

What do I understand by “text-continuation” model?

I am referring to the models we get by “just” optimizing for next-token prediction on a huge corpus of generic text. The loss function used is categorical cross-entropy on the correct token.

Unless applying heavy prompt engineering or finetunning (like we will see in this post), these models aren’t good at question-answering and aren’t necessarily aligned with their users. As a result of being trained on huge amounts of unsupervised data from the internet, they often return untruthful, toxic, or biased outputs.

In this post, we’ll see how these problems got addressed.

Visual representation of LLM finetuning.

2017: Reinforcement Learning from Human Preferences

Before directly jumping into LLMs let’s first go back and review the paper RL from human preferences from 2017¹ (context: Tip 1).

¹ This might look a bit disconnected from the main topic but bare with me.

Tip 1: State of RL in 2017

For me this was the golden age of Reinforcement Learning (RL) Tip 2:

New algorithms were being published left and right to solve OpenAI’s Gym environment tasks.
AlphaGo ² from Deepmind had just defeated Lee Sedol at GO: A game with a search space so big that it was considered almost impossible for a machine to achieve human-level performance.
DeepMind and OpenAI where battling to master StarCraft and Dota respectively with RL-based algorithms.
OpenAI was about to publish some research on impressive dexterity learning skills, and agents learning to play hide and seek.

² I can’t recommend enough the linked documentary, super emotional portrait of human vs machine capabilities.

³ I’m actually not sure how conscious they were about it on that moment…

⁴ For instance, think of gymnastics: for most of us it is very hard to demonstrate what a good move looks like 🤸‍♀️. However, it is very easy to tell apart a pathetic one from a decent one.

RL from human preferences laid the foundations of what would become key to later align language models³. In particular, they focus on hard-to-specify but easy-to-judge tasks: Oftentimes we can recognize a desired behavior but not necessarily demonstrate it⁴.

Tip 2: What is Reinforcement Learning?

Reinforcement Learning (RL) is a branch of Machine Learning concerned on how artificial agents ought to take actions in an environment in order to maximize some notion of reward. The objective for the agent is to learn an optimal policy (how to act in the given environment so as reward is maximized).

The typical framing of a Reinforcement Learning (RL) scenario: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

Arguably, this is what living beings also evolved to do: Act in a way that maximizes personal well-being and species continuation.

During my masters I became a bit obsessed with RL (and the reward hypothesis) and I have a lot of posts, paper summaries, and experiments around it in my old blog: CampusAI/RL.

How does it work?

Consider a basic RL problem where we don’t have a specified reward function for it, for whatever reason⁵.

⁵ Such as it being too hard, too sparse or too open-ended.

⁶ Good action $\to$ high reward. Bad action $\to$ low reward.

⁷ Remember that in the context of Rl a trajectory $τ$ is a sequence of state-action pairs: $τ = {(s_{0}, a_{0}), . . ., (s_{T}, a_{T})}$

One solution would be to have a human manually giving feedback (aka rewards) to the agent⁶. While this would work, it isn’t sample efficient: it takes too many human inputs, plus it might be hard to consistently give an absolute value to a particular trajectory⁷.

Their proposal is to train a new model (the reward predictor) which approximates the reward function⁸ in a supervised way. Its inputs are pairs of agent trajectories, and the learned outputs are human “preferability score”. The agent then updates its policy according to the reward given by this reward predictor model. This generalizes with few training examples and the human doesn’t need to label as much data 🥳.

⁸ Mapping from trajectories to desirability.

This is the overall idea:

How does the agent integrate human preferences?

Tip 3: From paired comparisons to an absolute scalar

This is all very nice conceptually, but a question remains, how does one map “I prefer this one over that one” to an actual scalar that can be used as a reward?

Simple, we assume that this scalar exists and we approximate it: You can think of it as some latent factor $r$ explaining the human preference associated with each trajectory. Imagine we get a dataset of paired trajectories and the human preference:

$D = {(σ_{i}^{1}, σ_{i}^{2}, μ_{i})}_{i = 1 : n}$

Where $σ^{1}$ is the first trajectory, $σ^{2}$ is the second one, and $μ \in [0, 1]$ indicating the preference (human label). For instance if the human marked $σ^{1} ≻ σ^{2}$ (1 is better than 2), then $μ = 0$ . If the human says both are the same, then $μ = \frac{1}{2}$ . The human can also mark the segments as incomparable, the sample is removed from the dataset.

We can then write the probability of preference as a softmax of the latent factor (expected reward):

$\hat{P} [σ^{1} ≻ σ^{2}] = \frac{e^{\hat{r} (σ^{1})}}{e^{\hat{r} (σ^{1})} + e^{\hat{r} (σ^{2})}}$

To learn the reward (latent factor) we can do cross-entropy loss:

$loss (\hat{r}) = - \sum_{(σ^{1}, σ^{2}, μ) \in D} μ (1) \log \hat{P} [σ^{1} ≻ σ^{2}] + μ (2) \log \hat{P} [σ^{2} ≻ σ^{1}] .$

SUMMARY: We are doing standard binary classification and using the logits as the reward value.

Tip 4: Other neat details about this paper

The reward predictor and the main agent’s policy are trained simultaneously. Thus, to train the agent they focus on RL methods which are robust to changes in the reward function: such as policy gradient methods. These methods directly learn the best policy, so a change in reward isn’t as critical as in the methods that attempt to learn the “value” of states or actions, such as value-based methods, or model-based methods.
Consider a simulated robot whose goal is to learn how to do a backflip. Interestingly, it was able to learn this trick with 1 hour of non-expert human feedback. On the contrary, it took 2 hours of an expert human to write a reward function which achieved the same goal.

2021: From text-continuation to instruction-following

Alright, fast forward to the Large Language Model (LLM) era (aka 2021). We now have very powerful LLMs which do next-token prediction very well (s.a. GPT-3). These are great to be finetuned to different tasks, one of which being question-answering.

How do we make sure their answers are factually correct, unbiased and aligned with our intentions? Actually this falls into the realm of things that are hard to define but (relatively) easy to evaluate. Wait!? Didn’t we just talk about a possible solution to these type of problems? 😱🤯 What were the chances??⁹

⁹ Actually quite high, given I planned this minimally.

Anyway, this is how they do it:

Step 0: Create a dataset of instructions and desired answers (humans do this)¹⁰.
Step 1: Good old supervised finetunning (SFT) of the Language Model (LM) using the previous dataset.
Step 2:
- Sample instructions from the dataset and obtain different answers from the model¹¹.
- Have a human rank those answers¹².
- Train the reward model: An adaptation of the LM that learns to rank answers of a given question.
Step 3:
- Sample questions from the dataset and generate new answers using the finetuned model.
- Forward these question-answer pairs through the reward model, obtaining a “score” (aka reward) for each one.
- Adjust the LM weights so that high-score answers are more likely than low-score ones using an RL algorithm. The algorithm they used is Proximal Policy Optimization (PPO) Tip 5.

¹⁰ Allegedly (and sadly not-very-surprisingly) under bad working conditions.

¹¹ I assume they do this by randomizing a bit the model outputs. For instance, by playing with the slack given by temperature or other sampling techniques (i.e. not greedily choosing the most likely next token).

¹² Same as if they were agent trajectories in an environment as we saw previously.

Recipe to get an aligned instruction-finetuned LM.

Tip 5: PPO, the general case

Alright, let’s put the first gear… This is the steepest part of all this. It might seem like going a bit too deep for this post but I still wanted to refresh my memory on those algorithms (they are very cool).

Basics

Consider a standard RL problem defined by a:

State space: $S = {s_{i}}_{i}$ . Set of all possible states the agent can be in.
An action space: $A = {a_{j}}_{j}$ . Set of all possible actions the agent can take in the given environment.
Reward function: $R (s_{t}, a_{t})$ . Reward given for taking a given action $a$ at time $t$ at state $s$ .

Policy Gradient Methods

Consider an explicit, differentiable, parametrized (by $θ$ ¹³) policy, directly mapping a given state to a probability distribution over the action space:

$π_{θ} : S \times A \to R$

Usually we write:

$π_{θ} (s, a) \equiv π_{θ} (a ∣ s) = p (a ∣ s)$

Policy gradient methods find the optimal policy parameters $θ^{⋆}$ directly differentiating over the following objective function (the reward expectation):

$J (θ) = E_{τ \sim π_{θ} (τ)} [R (τ)]$

Where $τ$ is a trajectory (succession of state-action pairs ${(s_{0}, a_{1}), . . ., (s_{n}, a_{n})}$ ¹⁴) in the given environment sampled by running $π_{θ}$ .

Exploiting the log gradient trick we obtain that: $\nabla_{θ} J (θ) = E_{τ \sim π_{θ} (τ)} [\nabla_{θ} \log π_{θ} (τ) R (τ)]$

Then, by plugging in $π_{θ} (τ) = p (s_{1}) \prod_{t = 1}^{T} π_{θ} (a_{t} | s_{t}) p (s_{t + 1} | s_{t}, a_{t})$ into the $\log$ and observing that the gradient is zero for all the terms not depending on $θ$ , we obtain:

$\nabla_{θ} J (θ) = E_{τ \sim π_{θ} (τ)} [(\sum_{t = 1}^{T} \nabla_{θ} \log π_{θ} (a_{t} | s_{t})) R (τ)]$

This is how the pseudocode of this looks like:

The basic REINFORCE algorithm implements this idea.

Actor-Critic Methods

Purely policy-gradient-based methods such as the REINFORCE suffer from high variance in gradient estimates in practice. This makes learning unstable and slow.

Oftentimes, they get combined with actor-critic methods. You won’t believe it, but those methods actually have two components:

The actor: Policy function that selects the actions based on current state
The critic: Function evaluating how good the state/action being taken is. Usually $V (s)$ if evaluating state-value, $Q (s, a)$ if evaluating action-value.

In the case of TRPO and PPO, they make use of the advantage function as a critic (A2C paper):

$A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$

It models how advantageous each action is in a given state.

Turns out we can measure how advantageous some parameters $θ^{'}$ are wrt some others $θ$ by doing something like that:

$J (θ) - J (θ_{OLD}) = E_{τ \sim π_{θ} (τ)} [A^{π_{θ_{OLD}}} (τ)]$

Iteratively finding better parameters maximizing this difference is known as Policy Improvement. It can be shown that this improvement objective is actually the same one as in Policy Iteration, thus guarantying the global optima. Thus, we want to optimize:

$max_{θ} E_{τ \sim π_{θ} (τ)} [A^{π_{θ_{OLD}}} (τ)]$

TRPO (Trust Region Policy Optimization)

TRPO (2015) was one of the first very successful policy gradient methods, achieving much better sampling efficiency than its value-based counterparts. Its main contribution is to turn the algorithm from an on-policy method into an off-policy method¹⁵. They do so by taking advantage of importance sampling:

$E_{x \sim p (x)} [f (x)] = E_{x \sim q (x)} [\frac{p (x)}{q (x)} f (x)]$

The objective function then becomes:

$max_{θ} E_{τ} [\frac{π_{θ} (τ)}{π_{θ_{OLD}} (τ)} A^{π_{θ_{OLD}}} (τ)]$

Subject to:

$E_{t} [K L (π_{θ_{OLD}} (\cdot ∣ s_{t}), π_{θ} (\cdot ∣ s_{t}))] < δ$

Where $K L$ is the Kullback–Leibler divergence between the two distributions. We are trying to improve the old parameters in the direction that maximizes the advantage (reward difference). However, we don’t allow the new policy to deviate too much from the old one to ensure stability. This is controlled by $δ \in R$ .

This algorithm is pretty good, but it is hard to implement the constraint optimization.

PPO (Proximal Policy Optimization)

PPO (2017) addresses the previous issue by introducing a slight change in the objective function. Let’s first define:

$r_{τ} (θ) := \frac{π_{θ} (τ)}{π_{θ_{OLD}} (τ)}$

The objective function of PPO is:

$L^{CLIP} (θ) = E_{τ} [min {r_{τ} (θ) A_{τ}^{π_{θ_{OLD}}}, clip (r_{τ} (θ), 1 - ϵ, 1 + ϵ) A_{τ}^{π_{θ_{OLD}}}}]$

Where $ϵ$ limits the incentive of “policy change amount”, typically around $0.2$ .

The idea is not to encourage the model to make big changes by limiting the obtained reward (done by the $clip$ function), while still heavily penalizing worse reward directions (done by the $min$ function)¹⁶. In ablation studies they show that in practice it works the same or better than TRPO, while being much simpler to implement.

¹⁶ If you ae trying to parse this, keep in mind that $A_{τ}$ could be negative. it is useful to split both cases

¹⁵ Essentially allowing the model to learn from trajectories sampled by a policy different from the currently being trained on.

¹⁴ I’m simplifying some things like considering finite state-spaces, action-spaces, and trajectories.

¹³ E.g. by the weights of an ANN.

Tip 6: PPO for aligning LLMs

Ok, so we learned how the PPO algorithm works for agents interacting in a generic environment Tip 5. Let’s see now how this maps into today’s problem (aligning LLMs):

State space: Input prompt given to the model.
An action space: Response given by the model.
Reward function: $R (s, a)$ . This is provided by the reward estimator, given a prompt and a model output, it returns a scalar of “how good” it things the pairing is.
Policy: Now the parametrized policy is the LLM itself: An “agent” which takes and input state (prompt) and outputs an action (response).

Notice that now there isn’t the temporal aspect from before, we are dealing with a one-step (static) problem. We don’t have a notion of “trajectories” we need to maximize the reward over. With all this in mind, we have that the reward expectation is something like that:

$J (θ) = E_{a \sim π_{θ} (s)} [R (s, a)]$

I assume they get $π_{θ} (s, a)$ evaluating the likelihood of that answer by the model (similar as one would get perplexity). Computing $r$ is quite simple then:

$r_{τ} (θ) := \frac{π_{θ} (τ)}{π_{θ_{OLD}} (τ)}$

Thus, we can just do a gradient step as we would in the PPO case Tip 5. If by now you are thinking something like “Wait… What are we even doing here? Can whatever-this-is be called RL??” Don’t worry, you are not alone. Here you have some twitter drama.

2022: From text-continuation to chatbot

Ok nice, we now know how to align a model to output less terrible answers. How do we make it be able to follow a conversation though?

Actually super easy, barely an inconvenience 😉. We use the same technique as before but now we add conversations to the dataset. To do so, humans mimic conversations (in which they play both sides)¹⁷.Everything else is the same.

¹⁷ They get suggestions from the LLM to more efficiently provide answers.

After ChatGPT a lot of other LLMs used ChatGPT answers to create aligning datasets, reducing thus the amount of work provided by humans.

Epilogue

As I write this, OpenAI announced a new model: o1. They claim “it spends more time thinking through problems before it responds” but don’t give much information on how they do it. There is some speculation which looks interesting. I might add an entry about it in the future if I find reliable insights on how they do it. That’s it for today folks.

Tip 7: References

To create this blog I “consumed” and summarized these amazing resources:

OpenAI blog: Learning from human preferences
OpenAI blog: Aligning language models to follow instructions
OpenAI blog: Introducing ChatGPT
Deep reinforcement learning from human preferences paper
PPO video by Edan Meyer.