Reinforcement learning with human feedback
Posted on Fri 12 May 2023 in reinforcement learning Updated: Fri 12 May 2023 • 5 min read
In basic terms, reinforcement learning is a technique that allows machine learning models to learn by tries and observations. It is beneficial whenever we need a more clear idea of what the optimal behavior for our model is - for ChatGPT for example. We don't have an optimal answer; instead a lot of answers could provide a good journey to the user. It's also a sequential decision-making system, where we predict in each iteration what should be the next word in our text. Although we do not know the optimal answers the model should provide, we still can evaluate the model behavior as good or bad (according to our "human preference" on the generated texts).
That is why RL is very well-suited for language models. A few concepts should be defined before we start digging into RLHF. Our large language model is considered to be the agent. The chat that allows us to interact is defined as the environment. The model performs actions, which in our case is to predict text sequentially. We can evaluate its behavior according to a policy - which is basically the way that the model understands the environment and responds to it. Finally, we can optimize it whenever by giving a positive or negative reward whenever it takes good or bad decisions. An episode is defined as all the steps that an agent took during a specific run.
Let's say you work in a company in the game industry that needs to develop bots to fight against players. You don't want to have fragile models since they may disrupt the player journey during a match, so they develop metrics such as bot mortality rate to make bots (our agents) a little more challenging. Bots start to perform actions according to the policy - i.e., given the current state of my environment, what should be the next step - until they perform better according to the specified metric - and therefore preserve the characteristics that lead to it using the reward system. It may learn that certain weapon is more powerful or that avoiding crowded places would lead to more survival time.
Check out this pubg bots review
But where do the humans come to play? For examples such as the one defined above, the metrics system that should optimize the model is well-aligned with the final output of the model. Unfortunately, when generating text, there's no metric that's really aligned with human preferences, so that's why we integrate human feedback into the loop.
In applying reinforcement learning to language models, context comes from the need to have models evaluated on human feedback and improve these models accordingly. The variety of tasks that these models are required to do require truth, creativity, and also spontaneity. Therefore, having automatic metrics to evaluate this very often needs a deeper correlation with human understanding.
Usually, the metrics that were used to evaluate language models don't reflect human preference (metrics such as BLEU, ROUGE, perplexity, and so on), so the models produced and measured by these evaluations were misaligned with their purpose. The main goal behind reinforcement learning with human feedback is to have a model that knows a lot about language in general (such as GPT-3) and improve its responses according to human preference.
The experiments conducted by the OpenAI team on the InstructGPT paper led to a 100x smaller model (with 1.3B parameters) to have preferred outputs rather than the GPT 3 (with 175B parameters). Annotators also found out that the technique improved the truthfulness according to the TruthfulQA benchmark, as well as showed small improvements towards reducing toxicity seen in GPT-3
Using RLHF involves multiple training steps, which are:
- Having an excellent pre-trained language model (GPT 3.5 for the first version of ChatGPT)
- Collecting data and training a reward model
- Finetuning the language model with reinforcement learning
Different kinds of data can be used to augment the model. For ChatGPT, openAI used human-generated text that was "preferable." In contrast, the Anthropic model used some other criteria to define which data was going to be used to finetune according to helpful, honest, and harmless parameters defined.
Then we have to finetune the reward model. The data used for this model combines both humans in the process and machines. OpenAI, for example, took prompts asked by users to previous GPT models and asked a language model to generate a bunch of responses for each of these prompts. Then, humans were asked to evaluate the generations and to rank them according to their preference (what would be the best answer for what the prompt asked). Some people initially thought that having these annotators annotate "scores" for each of these generated texts would work, but it turned out that this process was highly subjective and produced a lot of noisy data. Therefore, they moved forward with ranking the model outputs and only after that converting the rank to scalars - This is crucial to integrate existing Reinforcement Learning algorithms to our system seamlessly.
To finetune our model with RL, we follow the steps below:
1. Take a prompt X from our dataset
2. Generate text from both the policy model (initially finetuned) and our initial LM model
3. Pass the policy model generated text into the Reward model and take the "score."
4. Compare the two generated texts according to the Kullback-Leibler divergence method (Which compares the divergence between the sequences of distributions over tokens)
NOTE: This is very important since, without it, our policy model could start generating text that could fool the reward model without being honest with its information.
5. Combine both the divergence and the score of "preference" to update the tuned model with the RL policy
6. Update the RL policy model
More specifically, for the loss function used in the reward model, we have the following formula:
Where R0 represents the reward score when comparing x and y variances (yw represents the most preferred output, while yl is the least). K is the number of candidates shown to labelers to rank given a single prompt.
Limitations
Even after improving, these models still have limitations that should be pointed out and taken care of. They can produce a lot of untruths and still sound convincing, creating potential harm such as fake news and so on. Another limitation is that the models may produce very different results with only slight rephrasing on the prompt command, so it may require the user to "craft" optimal prompts to get the desired results - although this has been improved a lot from GPT-3. Other than that, whenever the model receives an ambiguous prompt, instead of interacting with the user to make the prompt clear (by asking clarifying questions) and therefore produce a more accurate response, it tries to guess what the prompt is asking for. Doing so may produce texts misaligned with the actual user need and lead to a poor experience.
References
Huggingface RLHF blog post
InstructGPT paper
OpenAI chatGPT blog post
OpenAI blog post on instruction following
Blog post on RL basics