LivingDataLab - Reinforcement learning from human feedback (RLHF) For LLMs

1 Introduction

In an earlier article we introduced Reinforcement learning from human feedback (RLHF) which is an important method used in modern large language models to help improve the performance and alignment of large language models.

In that article we looked at:

Describing how RLHF uses human feedback to improve the performance and alignment of large language models
Explaining how data gathered from human labelers is used to train a reward model for RLHF

In this post we will consider other aspects of Reinforcement learning, in particular we will look at:

Defining chain-of-thought prompting and describe how it can be used to improve LLMs reasoning and planning abilities
Discussing the challenges that LLMs face with knowledge cut-offs, and explain how information retrieval and augmentation techniques can overcome these challenges

2 The Reward model

We saw in an earlier article what we need to do to prepare our data so we are ready to train a Reward model, which is a key part of using Reinforcement learning from human feedback to better allign Large Language Models with human values.

Everything you need to train the reward model is now ready. While getting there requires some work, after you’ve finished teaching the reward model, you won’t need to add any more people to the loop. Instead, the model will substitute for the human labeler in the HF process and will automatically select the preferred completion. This language model typically doubles as a reward model. As an illustration, consider a model that has been taught to use supervised learning techniques on pairwise comparison data that you have created based on the human labelers’ evaluation of the prompts. The reward model learns to minimise the loss sigmoid of the reward difference, r_j-r_k, while favouring the human-preferred completion y_ j for a given prompt X.

The first choice with the label “y_j” is always the one that people favour. The reward model can be used as a binary classifier to offer a set of logics across the positive and negative classes after the model has been trained on the human rank prompt-completion pairs. The outputs of the unnormalized model that come before any activation function are called logics.

Consider the scenario where you want to clean your LLM and the reward model needs to determine whether the completion contains hate speech. The two classes in this situation would be the positive class, which you ultimately want to optimise for, and the negative class, which you ultimately want to avoid.

The largest value of the positive class is what you use as the reward value in LLHF. Just to remind you, if you apply a Softmax function to the logits, you will get the probabilities. The example here shows a good reward for non-toxic completion and the second example shows a bad reward being given for toxic completion. At this point, you have a powerful tool in this reward model for aligning your LLM. The next step is to explore how the reward model gets used in the reinforcement learning process to train your human-aligned LLM.

3 Fine-tuning with Reinforcement Learning

Let’s put it all together and examine how to apply the reward model in reinforcement learning to update the LLM weights and create a model that is aligned with humans. Keep in mind that you should begin with a model that performs well on the task you are interested in. You’ll strive to have an instruct model with you and then shift the LLM into more human valued alignment. A prompt from your prompt dataset will be passed first. In this instance, a dog is what the LLM is told to create, and that resulted in the creation of a fluffy animal. The prompt completion pair was then delivered to the reward model along with this completion. The reward model calculates a reward value after evaluating the pair in light of the human feedback it was trained on.

A more aligned response is represented by a larger number, like the one of 0.24 seen below. A response with less alignment would be given a lower score, such as -0.53. The reinforcement learning algorithm will then get this reward value for the prom completed pair and update the LLM’s weights to produce more aligned, higher reward answers. The RL upgraded LLM will be the name of this model’s interim version. A single iteration of the RLHF process is made up of these phases. In a manner comparable to other forms of fine tuning, these iterations run for a predetermined number of epics.

Here, you can see that the completion produced by the RL modified LLM receives a better reward score, demonstrating that the completion is now more aligned as a result of the weight updates. If the process is going well, you’ll notice that the reward gets better with each iteration as the model creates language that is more closely in line with what people like to read.

Until your model is in line with some evaluation criteria, you will keep going through this iterative procedure. For instance, achieving a certain level of the helpfulness you specified. Additionally, you can provide a maximum number of steps as the stopping point, like 20,000.

Let’s call the refined model at this point the human-aligned LLM. We haven’t yet talked about the precise characteristics of the reinforcement learning algorithm. This algorithm updates the LLM model weights based on the output of the reward model to enhance the reward score over time. For this step of the RLHF process, you can choose from a variety of different algorithms.

Proximal policy optimisation, or PPO for short, is a common option. Although PPO is a rather complex algorithm, you don’t need to be conversant with all of the specifics to utilise it. However, it can be a challenging algorithm to implement, and if you’re having issues getting it to function, a deeper grasp of its internal workings can help you debug.

4 Reward hacking

RLHF is a process for fine-tuning LLMs to match human preferences. The completions of a prompt data set are compared to some human preference metric, such helpful or not helpful, using a reward model in this process. The weights off the LLM are then updated using a reinforcement learning algorithm, in this case PPO, depending on the rewards generated by the LLM’s most recent version of completions. You’ll repeat this cycle of numerous rounds utilising a variety of prompts and adjustments to the model weights until you reach the level of alignment that you want. An LLM that is human oriented and ready for use in your application is the outcome.

Reward hacking, where the agent learns to game the system by favouring acts that maximise the reward received even if those actions don’t properly fit with the original purpose, is an interesting issue that can arise in reinforcement learning. Reward hacking in the context of LLMs might take the form of adding words or phrases to completions that produce high scores for the metric being aligned. But that lowers the language’s overall quality. Consider the scenario where you are utilising RHF to teach and cleanse a model. A reward model that can analyse sentiment and categorise model completions as toxic or non-toxic has already been trained.

You choose a prompt from the training data for this particular product, provide it to an LLM that receives instructions, and it produces a completion. This one is unpleasant and is likely to receive a high toxic rating because it is really negative. The PPO algorithm uses the score to adjust the model weights when the toxicity of reward model processes the completion and produces a score. RHF will adjust the LLM as you iterate to produce fewer harmful reactions. However, the strategy may stray too far from the original language model as it works to maximise the reward.

By inserting terms like “most awesome” and “most incredible” in this example, the model has begun to produce completions that, according to its learning, will result in very low toxicity scores. Very hyperbolic rhetoric is being used here. The model may potentially begin producing illogical, grammatically wrong language that just so happens to maximise the rewards; outputs like these are undoubtedly not very helpful. You can use the initial instruct LLM as a performance reference to avoid our board being hacked. We’ll refer to it as the reference model. During RHF iterations, the reference model’s weights are fixed and not modified.

In this manner, you continuously keep a single reference model for comparison. Each prompt is given to both models during training, resulting in a completion by both the reference LLM and the updated intermediate LLM model. Now that you have both completions, you can compare them and get the Kullback-Leibler divergence, or KL divergence. A statistical indicator of how dissimilar two probability distributions are is the KL divergence. It can be used to assess how much the updated model has deviated from the reference by comparing the results from the two models.

Many common machine learning libraries include the KL divergence algorithm, and you can use it without understanding all the underlying math. For each token that is generated across the whole lexicon, KL divergence is determined using the LLM. This may be as many as tens of thousands or even millions of tokens. However, by utilising a softmax function, you are able to significantly lower the number of possibilities from the whole vocabulary size. Remember that this technique is still fairly computationally intensive. Utilising GPUs will nearly always be advantageous.

Once the KL divergence between the two models was determined, the acid term was included to the reward calculation. If the RL updated model deviates too far from the reference LLM and produces two different completions, it will be penalised. It should be noted that in order to calculate the KL divergence, the frozen reference LLM, and the oral updated PPO LLM, two complete copies of the LLM are now required. In this situation, you simply update a path adapter’s weights, not the entire LLM’s weights. This means that the reference model and the PPO model, which you update with trained path parameters, can both use the same underlying LLM.

This roughly cuts in half the RAM footprint during training. The score you’ll use in this situation is the toxicity score, which measures the likelihood of a negative class—in this case, a poisonous or hostile response—assumed across all completions. This score ought to decrease if RHF is successful in reducing the toxicity of your LLM.

Using a reward model that can evaluate poisonous language, you will first establish a baseline toxicity score for the original instruct LLM by analysing its completions from the summarization data set. Then, using the same data set, you will assess your newly human-aligned model and compare the results. In this case, Arlo HF did actually result in a lower toxicity score, indicating a more aligned and less hazardous model.

5 Scaling human feedback

Although RLHF fine tuning can be performed without the requirement for human evaluation using a reward model, the trained reward model requires a significant amount of human labour to create. Large teams of labelers, often thousands of people, are typically needed to evaluate numerous prompts for the labelled data set used to train the reward model. The time and other resources needed for this activity can constitute significant constraints. Human effort becomes a scarce resource as the variety of models and use cases grows. A current field of study is how to scale human feedback.

Scaling through model self-supervision is one solution to these constraints. One strategy for large-scale supervision is constitutional AI. Constitutional AI, a technique for training models utilising a set of rules and principles that direct the model’s behaviour, was first put forth in 2022 by researchers at Anthropic. These make up the constitution along with some sample questions. The model is then taught to evaluate itself and modify its answers in accordance with those principles. Constitutional AI can assist in addressing the unexpected outcomes of RLHF in addition to scaling feedback. For instance, depending on how the request is written, an aligned model may divulge damaging information while attempting to give the most helpful answer possible.

Consider the scenario when you ask the model for advice on how to access your neighbor’s WiFi. Even though it is unlawful to do so, this model, which has been adjusted to prioritise helpfulness, actually informs you about an app that allows you to do so. The model can balance these conflicting interests and lessen the harm by being given a set of constitutional norms. Here are some guidelines from the study article that LLMs are asked to abide by in Constitutional AI I. For instance, you may instruct the model to select the answer that is the most beneficial, truthful, and safe.

However, you can set some restrictions on this by asking the model to prioritise safety by determining whether its response promotes illegal, unethical, or immoral behaviour. It should be noted that you are not required to utilise the paper’s rules; instead, you are free to create your own set of rules that are best suited to your domain and use case. The Constitutional AI approach requires that you train your model over the course of two different phases. Red teaming is a technique used in the first stage of supervised learning to start off by prompting the model in ways that try to persuade it to produce bad responses. The model is then instructed to evaluate its own damaging responses in light of the constitutional principles and then change them in order to adhere to those regulations.

When finished, use the pairs of red team prompts and the updated constitutional responses to fine-tune the model. Let’s take a look at an illustration of the creation of one of these prompt completion pairs. Now let’s get back to the WiFi hacking issue. As you already saw, this model strives to be as helpful as possible while yet responding to you negatively. The damaging completion and a set of predetermined instructions asking the model to evaluate its response are added to the prompt to help alleviate this. The model recognises the issues in its response using the guidelines provided in the Constitution. In this instance, it correctly understands that it is forbidden to access another person’s WiFi.

The last step is to bring everything together and ask the model to create a new answer that is free of any dangerous or unlawful material. The model creates a new response that applies constitutional rules and omits any mention to the illegal software. This last constitutional response and the initial red team prompt can both be used as training data. To develop a polished NLM that is trained to produce constitutional replies, you will accumulate a data set of several cases similar to this one. Reinforcement learning is carried out in the second step of the procedure.

This stage is comparable to RLHF, with the exception that we now use feedback produced by a model rather than input provided by humans. RLAIF, or reinforcement learning from AI feedback, is another name for this. Here, you create a collection of responses to your prompt using the refined model from the previous stage. Then you query the model to see which of the options is preferred in light of constitutional considerations.

The outcome is a preference dataset generated by the model that you may use to train a reward model. Now that you have this reward model, you may further optimise your model using a reinforcement learning algorithm like PPO, as was previously explained. Model alignment is a crucial subject and an active field of research. The RLHF fundamentals presented in this article will enable you to follow along as the field develops.

6 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by DeepLearning.ai and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Reinforcement learning from human feedback (RLHF) For LLMs - Part 2

Subscribe