Reinforcement learning from human feedback (RLHF) using Proximal Policy Optimisation

In this post we will look at Proximal Policy Optimization which is a powerful algorithm for solving reinforcement learning problems

Pranath Fernando


July 17, 2023

1 Introduction

In an earlier articles we introduced Reinforcement learning from human feedback (RLHF) which is an important method used in modern large language models to help improve the performance and alignment of large language models.

In this post we will look at Proximal Policy Optimization which is a powerful algorithm for solving reinforcement learning problems.

2 Proximal Policy Optimisation (PPO)

Proximal Policy Optimisation, or PPO, is a powerful approach for dealing with reinforcement learning issues. As the name implies, PPO tweaks a policy - in this example, the LLM: to better suit people’s preferences. The LLM is updated by PPO over a number of iterations. Proximal Policy Optimisation gets its name from the fact that the modifications are small and contained inside a restricted region, producing an updated LLM that is nearly identical to the original. A more stable learning occurs when the changes are contained inside this constrained area. The policy should be updated in order to maximise the incentive.

2.1 Phase 1

Through the value function, a different LLM head, we estimate this amount. Let’s examine the value function and the value loss in more detail. Assume that several prompts are provided. The reward for prompt completions is determined using the reward model once the LLM responses to the prompts are first generated.

For instance, the incentive for the first prompt completion shown above might be 1.87. The award for the following person may be -1.24, and so forth. There are a number of prompt completions that come with associated incentives. For a specific State S, the value function calculates the anticipated overall reward.

In other words, you want to estimate the entire future reward based on the present sequence of tokens as the LLM generates each completion’s token. Consider this a baseline from which to compare the quality of completions to your alignment requirements. Let’s assume that the expected future total prize is 0.34 at this point in the process of completion.

The predicted future total reward rises to 1.23 for the token that is generated after that. The objective is to reduce the value loss, which is the difference between the hypothetical future total reward of 1.87 and its approximation to the value function of 1.23, as measured by this example. Estimates for potential rewards are more accurate because of the value loss.

The advantage estimation process in Phase 2, which we shall explain shortly, uses the value function after that. This is comparable to when you begin writing a piece and have a general concept of how it will turn out even before you start writing it. You indicated that the weights are updated in Phase 2 using the losses and rewards established in Phase 1 to produce an updated LLM.

2.2 Phase 2

In Phase 2, you make a few small changes to the model and assess how those changes affect your model’s alignment objective. The prompt completion, losses, and incentives serve as a reference for updating the model weights. Additionally, PPO makes sure to maintain model updates within a certain small region called the trust region.

The proximal component of PPO is used in this situation. This succession of minor adjustments ought to steer the model in the direction of more returns. The primary component of this strategy is the PPO policy objective. Keep in mind that the goal is to identify a policy with a high expected benefit. In other words, you are attempting to modify the LLM weights to produce completions that are more in line with human tastes and, hence, yield higher rewards. The PPO algorithm’s primary goal during training is to minimise the policy loss.

Although the maths of this equation seems difficult, it is actually quite straightforward. Let’s dissect it step by step. Prioritise the most crucial phrase for the time being and disregard the others. Pi of A_t given S_t is the likelihood that the next token A_t will appear given the current prompt S_t in this instance of an LLM. The state S_t is the completed prompt up to the token t, and the action A_t is the subsequent token. The likelihood that the following token will use the first, frozen version of the LLM is the denominator. Through the updated LLM, which we may modify for a greater payout, the numerator represents the odds of the subsequent token. The estimated advantage term of a specific course of action is denoted by the symbol A-hat_t.

The benefit term calculates how much better or worse the current action is in comparison to all other actions that could be taken at the present data condition. We estimate how favourable this completion is compared to the others by taking a look at the anticipated future rewards of a completion that comes after the new token. This amount can be estimated using a recursive calculation based on the value function we previously mentioned. Here, we emphasise intuitive comprehension. The figure’s several paths in the coloured chart below serve as examples of the various ways you can fulfil the question S. The advantage term reveals how superior or inferior the current token A_t is in comparison to all other tokens.

The top path in this visualisation leads to better completion and a higher reward. The worst completion is the bottom path, which descends.

Why, then, does extending this duration result in greater rewards? Let’s think about the scenario where the recommended token has a benefit. A benefit indicates how much better than average the suggested token is. As a result, raising the likelihood of the present token seems like a wise move that produces greater benefits. This translates to making the most of this expression. The benefit will be negative if the proposed token performs worse than average. Once more, depromoting the token by maximising the expression is the right course of action. In light of this, the general conclusion is that increasing this expression leads to a more aligned LLM.

Therefore, let’s simply maximise this expression. Since our computations are accurate under the supposition that our benefit estimations are reliable, directly maximising the expression might result in issues. Only when the old and new policies are closely related to one another are the advantage estimations accurate. The remaining terms are used in this context. In this case, choosing the smaller of the two terms is what happens after taking a step back and reviewing the entire equation. both the first modified version we just talked about and this second one. It’s important to note that this second expression designates a location where two policies are close to one another.

These additional words serve as guardrails, identifying a zone close to the LLM where our predictions have minimal inaccuracy. The trust region is where this occurs. We are unlikely to depart from the trust area thanks to these additional conditions. The PPO policy aim can be optimised to produce a superior LLM without overshooting into unreliable zones.

In addition, there is entropy loss. Entropy permits the model to keep innovation as the policy loss drives it towards the alignment objective. If entropy was kept to a minimum, you might end up constantly answering the prompt in the manner that is illustrated above. The LLM is guided towards more creativity by higher entropy. This is similar to the temperature setting of LLM.

The difference is that although entropy influences model creativity during training, temperature influences model creativity at the time of inference. We obtain our PPO objective by weighing the sum of all words, which updates the model steadily in the direction of human preference. The main PPO goal is to achieve this. These coefficients, C1 and C2, are hyperparameters. The PPO goal does back propagation across a number of steps to update the model weights. PPO begins a new cycle after updating the model weights.

2.3 Iterate to produce Human-Aligned LLM

The revised LLM is used for the subsequent iteration, and a fresh PPO cycle is initiated. You eventually reach the LLM that is human-aligned after several cycles. Are there any alternative RLHF reinforcement learning techniques? Yes.

For instance, Q-learning is a different approach for optimising LLMs through RL, but PPO is currently the most common approach. PPO, is well-liked because it strikes the ideal mix between complexity and effectiveness. In spite of this, there is ongoing study into optimising LLMs through feedback from humans or artificial intelligence.

In the near future, there will likely be a lot more developments in this field. As an easier substitute for RLHF, Stanford researchers recently published a paper describing a method termed direct preference optimisation. It will take more research to fully comprehend the advantages of new approaches like these, but this is a really intriguing field of study.

3 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.