Reinforcement learning from human feedback (RLHF) For LLMs - Part 1

In this post we will introduce Reinforcement learning from human feedback (RLHF) which is an important method used in modern large language models to help improve the performance and alignment of large language models.
natural-language-processing
deep-learning
fine-tuning
Author

Pranath Fernando

Published

July 15, 2023

1 Introduction

In this post we will introduce Reinforcement learning from human feedback (RLHF) which is an important method used in modern large language models to help improve the performance and alignment of large language models.

In particular we will look at:

  • Describing how RLHF uses human feedback to improve the performance and alignment of large language models
  • Explaining how data gathered from human labelers is used to train a reward model for RLHF

2 Aligning models with human values

In earlier articles we examined the fine-tuning method in detail. The purpose of further training models using instructions, such as path techniques, is to help them comprehend human-like prompts and produce more human-like responses. This can result in a model performing far better than its initial pre-trained based version and producing language that sounds more natural. A new set of difficulties are presented by human language that sounds natural.

You’ve probably read a lot of headlines by this point about how poorly huge language models behave. Models responding in confrontational and violent voices, using poisonous language in their completions, and divulging extensive details about risky subjects are all problematic. Large models that are trained on enormous amounts of text data from the Internet, where such language regularly appears, are the cause of these issues. Here are several instances of models acting inappropriately.

Let’s assume you want your LLM to tell you knock, knock, joke, and the models responses just clap, clap. While funny in its own way, it’s not really what you were looking for. The completion here is not a helpful answer for the given task. Similarly, the LLM might give misleading or simply incorrect answers. If you ask the LLM about the disproven Ps of health advice like coughing to stop a heart attack, the model should refute this story.

Instead, the model might respond with confidence and an outright false statement, which is unquestionably not the genuine and authentic reaction a person is looking for. Additionally, the LLM shouldn’t produce negative completions that could be disrespectful, discriminating, or encourage illegal activity, as is the case when you ask the model how to hack your neighbor’s WiFi and it provides a sound method in its response. Ideally, it would offer a solution that doesn’t cause harm.

The acronym HHH, which stands for helpfulness, honesty, and harmlessness, refers to a set of rules that developers should follow when using AI responsibly. In order to better match models with human preferences and to improve the completions’ usefulness, honesty, and safety, further fine-tuning with human feedback is beneficial. This additional instruction can also aid in lowering toxicity, often models reactions, and lowers the generation of false information.

3 Reinforcement learning from human feedback (RLHF)

Consider the task of a text summary, where you apply the model to produce a brief sentence that summarises the key ideas in a document. By giving the model examples of manually created summaries, you hope to improve the model’s capacity to summarise. Research from OpenAI that examined the use of fine-tuning with human input to train a model to create concise summaries of text articles was published in 2020.

Here, you can see that a model that was adjusted based on human feedback outperformed a pretrained model, an instruction-based model, and even the reference human baseline in terms of response quality. Reinforcement learning from human feedback, or RLHF for short, is a well-liked method for optimising sizable language models with human feedback.

As suggested by the name, RLHF uses reinforcement learning, or RL for short, to modify the LLM with information from user feedback, producing a model that is more in line with user preferences. By using RLHF, you can make sure that your model generates outputs that are as relevant and pertinent to the input prompt as possible. Most significantly, RLHF can help reduce the possibility of harm. You can teach your model to avoid using poisonous language and to avoid talking about things that are off-limits.

The personalization of LLMs, where models are taught the preferences of each unique user through a continual feedback process, is one potentially fascinating application of RLHF. This might result in innovative new technologies like customised AI assistants or personalised learning strategies. But let’s start by looking more closely at how RLHF functions in order to see how these potential future uses might be made practical. Here is a high level summary of the key ideas in reinforcement learning in case you are unfamiliar with it. Reinforcement learning is a sort of machine learning in which an agent learns to decide on actions in the environment with the aim of maximising some idea of a cumulative reward.

By taking actions, observing the changes in the environment that ensue, and then receiving rewards or punishments based on the results of those actions, the agent continuously learns from its experiences in this framework. Through this process of iteration, the agent gradually improves its strategy or policy to enable improved decision-making and raise its likelihood of success.

Training a model to play tic tac toe is a good way to demonstrate these concepts. Let’s look at it. In this illustration, the agent is a model or policy playing tic tac toe. To win the game is its goal. The three by three game board serves as the environment, and its current arrangement serves as the state. All of the different locations that a player can select based on the state of the board are included in the action space. The agent follows a method known as the RL policy while making judgements. Now, as the agent conducts activities, it receives incentives based on how successfully those acts advance it towards a win.

The objective of reinforcement learning is for the agent to discover the best course of action in a specific environment that maximises rewards. Iterative learning entails making mistakes along the way. The agent initially performs a random action that results in a new state. The agent then moves on to investigate other states by taking more actions from this state. The playout, also known as the rollout, is made up of a sequence of activities and their accompanying states. The agent eventually succeeds in the game as it gains experience and gradually learns which acts produce the biggest long-term rewards.

Let’s examine how the Tic-Tac-Toe example can be applied to the situation of optimising big language models with RLHF now. In this instance, the LLM serves as the agent’s guiding policy, and its goal is to produce content that is seen as being in line with human preferences. This could imply that the text is beneficial, accurate, and non-toxic, for example. The model’s context window, or environment, is where text can be entered via a prompt. The present context is the state that the model takes into account before acting. Any text that is currently present in the context pane qualifies. The act of creating text is the action in this situation.

Based on the user-specified goal, this could be a single word, a sentence, or a lengthier form of text. The token vocabulary, or all the potential tokens from which the model can select to construct the completion, is referred to as the action space. The statistical model of language that an LLM learns during training determines how it chooses to generate the next token in a series. The cue text in the context and the probability distribution throughout the vocabulary space determine the action the model will take at any given time, or the token it will select next. Depending on how closely the completions match human tastes, a reward is given.

The task of identifying the reward is more difficult than in the Tic-Tac-Toe example because of the wide range of human responses to language. One approach to accomplish this is to have a human assess each model completion in comparison to some alignment metric, such as identifying whether or not the language created is toxic. Either a zero or a one can be used to represent this feedback as a scalar value. The model can then produce non-toxic completions because the LLM weights are changed iteratively to maximise the reward from the human classifier.

However, getting human feedback can be time and money-consuming. You may categorise the LLM’s outputs and assess how closely they fit with human preferences by using a second, more practical and scalable model called the reward model. Starting with fewer human instances, you’ll use more conventional supervised learning techniques to train the secondary model. Once trained, you can evaluate the LLM’s output using the reward model and award it a reward value, which is then used to update the LLM’s weights and train a new human-aligned version.

The algorithm that is used to optimise the policy determines how the weights are modified as the model completions are evaluated. The sequence of actions and states is referred to as a rollout in the context of language modelling rather than a playout as it is in traditional reinforcement learning. The core element of the reinforcement learning process is the reward model. It is crucial to how the model adjusts its weights over numerous iterations since it encodes all of the preferences that have been discovered from human feedback. You can see how this model is trained and how to use it to categorise the model’s outputs throughout the reinforcement learning process in the next section.

4 Obtaining feedback from humans

The choice of a model to work with and usage of it to produce a data collection for human feedback is the first stage in fine-tuning an LLM with RLHF. Whether it’s text summarising, question answering, or another activity, the model you select should be able to perform it in some capacity. Generally speaking, you can find it simpler to begin with an existing model that has already been optimised across numerous jobs and has some basic characteristics. You’ll then create a variety of replies for each prompt using this LLM and a prompt data set. The prompt dataset consists of various prompts, and the LLM analyses each one to provide a set of completions.

The next stage is to gather input on the completions produced by the LLM from human labelers. This is the part of reinforcement learning that incorporates human feedback. You must first choose the criterion that will be used to evaluate the completions by people. Any of the topics covered thus far, such as helpfulness or toxicity, could be this. After making a choice, ask the labelers to evaluate each completion in the data set according to that criterion.

Let’s take a look at an example. In this case, the prompt is, my house is too hot. You pass this prompt to the LLM, which then generates three different completions.

Your labelers’ assignment is to rank the three completions from most helpful to least helpful in terms of their value. Thus, the labeler will most likely find that completion two is the most beneficial in this case. It rates as the first completion and informs the user of a practical way to chill their home. Although neither completion one nor completion three are very useful, the labeler may decide that three is the worse of the two because the model actively rejects the user’s input. The labeler then places the final completion third and the first completion in second place.

Then, this process is repeated for numerous prompt completion sets, creating a data set that may be utilised to train the reward model that will eventually take over for the people in this task. Multiple human labelers are typically given the identical prompt completion sets to complete in order to reach consensus and lessen the impact of any subpar labelers. This is actually a pretty critical aspect, much like the third labeler above, whose responses differ from the rest and would suggest that they misinterpreted the instructions. The quality of the human input you receive can greatly depend on how well your instructions are written. Labelers are frequently chosen from demographic samples that reflect a variety of worldviews.

Here is an illustration of a set of instructions for manual labelers. This would be available for the labeler to refer to as they progress through the dataset and would be offered to them to read before they started the task. The instructions begin by outlining the main duty that the labeler must complete. To pick the best response to the prompt in this instance. The instructions go on to provide more information to help the labeler finish the assignment.

Generally speaking, the more specific you are with these directions, the more likely it is that the labelers will comprehend the task at hand and perform it correctly. For instance, the second instruction item instructs the labelers to base their decisions on their assessment of the accuracy and usefulness of the response. They are advised to investigate facts and locate additional information online.

Additionally, lablers are given detailed instructions on what to do in the event that they come across two completions that they believe to be equally accurate and instructive, known as a tie. The labelers are instructed that ranking two completions equally is acceptable but that they should only do so occasionally.

What to do in the event of a bad, confused, or irrelevant response is the last piece of advice worth mentioning here. In this situation, labelers ought to use F rather than rank so that the low-quality responses can be quickly eliminated. A thorough set of instructions like this one enhances the possibility that the responses will be of good quality and that unique persons will do the work in a manner that is comparable to one another. This can make it more likely that the collection of labelled completions will accurately reflect the consensus viewpoint.

You will have all the information necessary to train the reward model once your human labelers have completed their evaluations of the Prompt completion sets. which you’ll employ in place of people during the reinforcement learning fine-tuning phase to categorise model completions. However, you must first transform the ranking data into a pairwise comparison of completions before you can begin to train the reward model. In other words, all feasible pairings of responses to a prompt from the given options should be assigned a score of 0 or 1. In the example presented here, there are three responses to a prompt, and the human labelers assigned a ranking of 2, 1, and 3, as displayed, with 1 being the highest rank and denoting the most desired response.

There are three possible pairs of the three completions: purple-yellow, purple-green, and yellow-green. You will have N select two combinations based on the N alternative completions per query. You will provide a reward of 1 for the more desired response in each pair and a reward of 0 for the less desired response. The questions will then be rearranged such that the desired option appears first.

This is necessary because the reward model anticipates the desired completion, known as Yj first. Once this data reorganisation is complete, the human responses will be in the proper format for the reward model training. Though ranking feedback is frequently easier to collect than thumbs-up, thumbs-down feedback, ranked feedback gives you more data on prom completion to train your reward model. As you can see, each human ranking gives you three prompt completion pairs.

5 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by DeepLearning.ai and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe