Grab some popcorn 🍿! Today we’ll check how OpenAI turned the monster one gets from training on raw internet data into something useful. We’ll walk through three key stages:
- Text-continuation model (GPT-3, June 2020)
- Instruction-following aligned model (InstructGPT, Jan 2022)
- Chatbot (ChatGPT, November 2022).
I am referring to the models we get by “just” optimizing for next-token prediction on a huge corpus of generic text. The loss function used is categorical cross-entropy on the correct token.
Unless applying heavy prompt engineering or finetunning (like we will see in this post), these models aren’t good at question-answering and aren’t necessarily aligned with their users. As a result of being trained on huge amounts of unsupervised data from the internet, they often return untruthful, toxic, or biased outputs.
In this post, we’ll see how these problems got addressed.
2017: Reinforcement Learning from Human Preferences
Before directly jumping into LLMs let’s first go back and review the paper RL from human preferences from 20171 (context: Tip 1).
1 This might look a bit disconnected from the main topic but bare with me.
For me this was the golden age of Reinforcement Learning (RL) Tip 2:
New algorithms were being published left and right to solve OpenAI’s Gym environment tasks.
AlphaGo2 from Deepmind had just defeated Lee Sedol at GO: A game with a search space so big that it was considered almost impossible for a machine to achieve human-level performance.
DeepMind and OpenAI where battling to master StarCraft and Dota respectively with RL-based algorithms.
OpenAI was about to publish some research on impressive dexterity learning skills, and agents learning to play hide and seek.
2 I can’t recommend enough the linked documentary, super emotional portrait of human vs machine capabilities.
3 I’m actually not sure how conscious they were about it on that moment…
4 For instance, think of gymnastics: for most of us it is very hard to demonstrate what a good move looks like 🤸♀️. However, it is very easy to tell apart a pathetic one from a decent one.
RL from human preferences laid the foundations of what would become key to later align language models3. In particular, they focus on hard-to-specify but easy-to-judge tasks: Oftentimes we can recognize a desired behavior but not necessarily demonstrate it4.
Reinforcement Learning (RL) is a branch of Machine Learning concerned on how artificial agents ought to take actions in an environment in order to maximize some notion of reward. The objective for the agent is to learn an optimal policy (how to act in the given environment so as reward is maximized).
Arguably, this is what living beings also evolved to do: Act in a way that maximizes personal well-being and species continuation.
During my masters I became a bit obsessed with RL (and the reward hypothesis) and I have a lot of posts, paper summaries, and experiments around it in my old blog: CampusAI/RL.
How does it work?
Consider a basic RL problem where we don’t have a specified reward function for it, for whatever reason5.
5 Such as it being too hard, too sparse or too open-ended.
6 Good action
7 Remember that in the context of Rl a trajectory
One solution would be to have a human manually giving feedback (aka rewards) to the agent6. While this would work, it isn’t sample efficient: it takes too many human inputs, plus it might be hard to consistently give an absolute value to a particular trajectory7.
Their proposal is to train a new model (the reward predictor) which approximates the reward function8 in a supervised way. Its inputs are pairs of agent trajectories, and the learned outputs are human “preferability score”. The agent then updates its policy according to the reward given by this reward predictor model. This generalizes with few training examples and the human doesn’t need to label as much data 🥳.
8 Mapping from trajectories to desirability.
This is the overall idea:
This is all very nice conceptually, but a question remains, how does one map “I prefer this one over that one” to an actual scalar that can be used as a reward?
Simple, we assume that this scalar exists and we approximate it: You can think of it as some latent factor
Where
We can then write the probability of preference as a softmax of the latent factor (expected reward):
To learn the reward (latent factor) we can do cross-entropy loss:
SUMMARY: We are doing standard binary classification and using the logits as the reward value.
The reward predictor and the main agent’s policy are trained simultaneously. Thus, to train the agent they focus on RL methods which are robust to changes in the reward function: such as policy gradient methods. These methods directly learn the best policy, so a change in reward isn’t as critical as in the methods that attempt to learn the “value” of states or actions, such as value-based methods, or model-based methods.
Consider a simulated robot whose goal is to learn how to do a backflip. Interestingly, it was able to learn this trick with 1 hour of non-expert human feedback. On the contrary, it took 2 hours of an expert human to write a reward function which achieved the same goal.
2021: From text-continuation to instruction-following
Alright, fast forward to the Large Language Model (LLM) era (aka 2021). We now have very powerful LLMs which do next-token prediction very well (s.a. GPT-3). These are great to be finetuned to different tasks, one of which being question-answering.
How do we make sure their answers are factually correct, unbiased and aligned with our intentions? Actually this falls into the realm of things that are hard to define but (relatively) easy to evaluate. Wait!? Didn’t we just talk about a possible solution to these type of problems? 😱🤯 What were the chances??9
9 Actually quite high, given I planned this minimally.
Anyway, this is how they do it:
- Step 0: Create a dataset of instructions and desired answers (humans do this)10.
- Step 1: Good old supervised finetunning (SFT) of the Language Model (LM) using the previous dataset.
- Step 2:
- Step 3:
- Sample questions from the dataset and generate new answers using the finetuned model.
- Forward these question-answer pairs through the reward model, obtaining a “score” (aka reward) for each one.
- Adjust the LM weights so that high-score answers are more likely than low-score ones using an RL algorithm. The algorithm they used is Proximal Policy Optimization (PPO) Tip 5.
10 Allegedly (and sadly not-very-surprisingly) under bad working conditions.
11 I assume they do this by randomizing a bit the model outputs. For instance, by playing with the slack given by temperature or other sampling techniques (i.e. not greedily choosing the most likely next token).
12 Same as if they were agent trajectories in an environment as we saw previously.
Alright, let’s put the first gear… This is the steepest part of all this. It might seem like going a bit too deep for this post but I still wanted to refresh my memory on those algorithms (they are very cool).
Basics
Consider a standard RL problem defined by a:
State space:
. Set of all possible states the agent can be in.An action space:
. Set of all possible actions the agent can take in the given environment.Reward function:
. Reward given for taking a given action at time at state .
Policy Gradient Methods
Consider an explicit, differentiable, parametrized (by
Usually we write:
Policy gradient methods find the optimal policy parameters
Where
Exploiting the log gradient trick we obtain that:
Then, by plugging in
This is how the pseudocode of this looks like:
Actor-Critic Methods
Purely policy-gradient-based methods such as the REINFORCE suffer from high variance in gradient estimates in practice. This makes learning unstable and slow.
Oftentimes, they get combined with actor-critic methods. You won’t believe it, but those methods actually have two components:
- The actor: Policy function that selects the actions based on current state
- The critic: Function evaluating how good the state/action being taken is. Usually
if evaluating state-value, if evaluating action-value.
In the case of TRPO and PPO, they make use of the advantage function as a critic (A2C paper):
It models how advantageous each action is in a given state.
Turns out we can measure how advantageous some parameters
Iteratively finding better parameters maximizing this difference is known as Policy Improvement. It can be shown that this improvement objective is actually the same one as in Policy Iteration, thus guarantying the global optima. Thus, we want to optimize:
TRPO (Trust Region Policy Optimization)
TRPO (2015) was one of the first very successful policy gradient methods, achieving much better sampling efficiency than its value-based counterparts. Its main contribution is to turn the algorithm from an on-policy method into an off-policy method15. They do so by taking advantage of importance sampling:
The objective function then becomes:
Subject to:
Where
This algorithm is pretty good, but it is hard to implement the constraint optimization.
PPO (Proximal Policy Optimization)
PPO (2017) addresses the previous issue by introducing a slight change in the objective function. Let’s first define:
The objective function of PPO is:
Where
The idea is not to encourage the model to make big changes by limiting the obtained reward (done by the
16 If you ae trying to parse this, keep in mind that
15 Essentially allowing the model to learn from trajectories sampled by a policy different from the currently being trained on.
14 I’m simplifying some things like considering finite state-spaces, action-spaces, and trajectories.
13 E.g. by the weights of an ANN.
Ok, so we learned how the PPO algorithm works for agents interacting in a generic environment Tip 5. Let’s see now how this maps into today’s problem (aligning LLMs):
State space: Input prompt given to the model.
An action space: Response given by the model.
Reward function:
. This is provided by the reward estimator, given a prompt and a model output, it returns a scalar of “how good” it things the pairing is.Policy: Now the parametrized policy is the LLM itself: An “agent” which takes and input state (prompt) and outputs an action (response).
Notice that now there isn’t the temporal aspect from before, we are dealing with a one-step (static) problem. We don’t have a notion of “trajectories” we need to maximize the reward over. With all this in mind, we have that the reward expectation is something like that:
I assume they get
Thus, we can just do a gradient step as we would in the PPO case Tip 5. If by now you are thinking something like “Wait… What are we even doing here? Can whatever-this-is be called RL??” Don’t worry, you are not alone. Here you have some twitter drama.
2022: From text-continuation to chatbot
Ok nice, we now know how to align a model to output less terrible answers. How do we make it be able to follow a conversation though?
Actually super easy, barely an inconvenience 😉. We use the same technique as before but now we add conversations to the dataset. To do so, humans mimic conversations (in which they play both sides)17.Everything else is the same.
17 They get suggestions from the LLM to more efficiently provide answers.
After ChatGPT a lot of other LLMs used ChatGPT answers to create aligning datasets, reducing thus the amount of work provided by humans.
Epilogue
As I write this, OpenAI announced a new model: o1. They claim “it spends more time thinking through problems before it responds” but don’t give much information on how they do it. There is some speculation which looks interesting. I might add an entry about it in the future if I find reliable insights on how they do it. That’s it for today folks.
To create this blog I “consumed” and summarized these amazing resources:
- OpenAI blog: Learning from human preferences
- OpenAI blog: Aligning language models to follow instructions
- OpenAI blog: Introducing ChatGPT
- Deep reinforcement learning from human preferences paper
- PPO video by Edan Meyer.