Improve Large Language Models with Instruction Fine-Tuning

In this article we will look at methods that you can use to improve the performance of an existing large language model for your specific use case using instruction fine-tuning

Pranath Fernando


July 10, 2023

1 Introduction

In this post, we’ll look at techniques you might employ to make an existing large language model more effective for your particular use case using a method called instruction fine-tuning. We will also see how this differs from using prompts and in-context prompt learning.

2 Limitations of In-Context Prompt Learning

Zero shot inference can be correctly performed by some models when they are able to recognise the instructions in a prompt, but it is possible for smaller LLMs, like the one used in this example, to fall short. One shot or few shot inference, which involves giving the model one or more samples of what you want it to perform, can be sufficient to let it recognise the task and produce a good completion. This tactic, however, has a few shortcomings.

First, even with five or six samples, it doesn’t always work for smaller models. Second, any examples you give in your prompt consume important context window real estate, leaving less space for other important information.

Fortunately, there is another option; you may use the fine-tuning procedure to further train a basic model.

3 Instruction Fine-Tuning

Fine-tuning is a supervised learning method where you utilise a data collection of labelled examples to update the weights of the LLM. This is in contrast to pre-training where you train the LLM using enormous volumes of unstructured textual data via selfsupervised learning.

The labelled examples are prompt completion pairs, and the fine-tuning procedure prolongs the model’s training to enhance its capacity to produce high-quality completions for a given job. The performance of a model can be significantly enhanced by using a technique called instruction fine tuning. Let’s examine this more closely. Instruction fine-tuning trains the model by providing examples that show how it should react to a given instruction. Here are a few examples of prompts to illustrate this concept.

Both instructions ask you to categorise this review, and the ideal result is a text string that begins with sentiment and ends with either a good or negative sentiment. For the task you’re interested in, there are numerous pairs of prompt completion examples with instructions in the data set you utilise for training.

For instance, you would create a data set of samples that start with the word summarise, the text after this, or a phrase close to it if you wanted to fine-tune your model to increase its capacity to summarise. Additionally, your examples should include requests like “Translate this sentence” if you want to improve the model’s translation abilities.

The model can learn to provide responses that adhere to the specified instructions thanks to these examples of prompt completion. Full fine-tuning is the process of updating all of the weights in an instruction set. A fresh version of the model with revised weights is produced by the method. Remember that full fine tuning needs enough memory and compute resources to store and process all the gradients, optimizers, and other components that are updated during training. This is similar to pre-training. Thus, the memory optimisation and parallel computing techniques that you learnt about last week can be useful.

4 Creating Datasets for Instruction Fine-Tuning

So how exactly do you go about LLM and instruction fine-tuning? Preparing your training data is the first step. Although the majority of datasets are not written as instructions, there are several publicly available datasets that have been used to train earlier generations of language models. Fortunately, developers have created prompt template libraries that can be used to transform existing datasets into instruction prompt datasets for fine-tuning, such as the substantial data set of Amazon product reviews. Many templates for various jobs and data types are available in prompt template libraries.

Here are three questions that can be used to fine-tune models for classification, text creation, and text summarising tasks and are created to work with the Amazon reviews dataset. You can see that you provide the original review—here referred to as review_body—to the template in each case, where it is put into the text that follows a directive such as “predict the associated rating,” “generate a star review,” or “give a short sentence describing the following product review.” As a result, the prompt now includes both the example from the data set and an instruction. As with conventional supervised learning, you split the instruction data set into training validation and test splits as soon as it is ready.

You choose prompts from your training data set and give them to the LLM during fine tuning. The LLM then generates completions. The LLM completion is then contrasted with the response recorded in the training data. You can see that the model did a poor job here because it only gave the review a somewhat underwhelming neutral classification. Clearly, the review is highly favourable. Always keep in mind that an LLM produces a probability distribution over tokens. In order to determine the loss between the two token distributions, you can compare the distribution of the completion and that of the training label. To do this, use the standard crossentropy function. Then, using conventional backpropagation, update your model weights using the estimated loss.

In order to enhance the model’s performance on the task, you’ll change the weights across a number of epochs and do this for numerous batches of prompt completion pairs. Using the holdout validation data set, you may design distinct evaluation steps to gauge your LLM performance, much like in conventional supervised learning. After finishing your fine-tuning, you may use the holdout test data set to conduct a final performance review. This will give you the validation accuracy.

You will receive the test accuracy from this. A new version of the base model, frequently referred to as an instruct model, that is more adept at the tasks you are interested in is produced through the fine-tuning process. Today, fine-tuning LLMs is most frequently done by using instruction prompts. From this point forward, you can assume that whenever you hear or see the term “fine-tuning,” it refers to the fine-tuning of instructions.

5 Instruction Fine-Tuning on a Single Task

While LLMs have gained notoriety for their capacity to handle a variety of linguistic tasks under a single model, your application might only require them to handle one. In this situation, you can adjust a pre-trained model to perform better exclusively on the task that interests you. For instance, summarization for that purpose using a dataset of examples. It’s interesting to note that with relatively few samples, good outcomes can be obtained. In contrast to the billions of texts that the model saw during pre-training, good performance is frequently achieved with just 500–1,000 instances. However, focusing on one activity for fine-tuning could have drawbacks. The procedure could result in a condition known as catastrophic forgetting.

Because the weights of the initial LLM are changed during the complete fine-tuning process, catastrophic forgetting can occur. While can result in excellent performance on the lone fine-tuning task, performance on other tasks may suffer. For instance, while fine-tuning can enhance a model’s capacity to carry out sentiment analysis on a review and lead to a quality completion, the model might forget how to carry out other jobs. Before being fine-tuned to correctly identify Charlie as the name of the cat in the phrase, this model was able to do named entity recognition.

However, after further training, the model is unable to complete this work, misleading the entity it is intended to identify as well as improving the behaviour specific to the new task. What alternatives exist for preventing catastrophic forgetting? Before to making any decisions, it’s crucial to consider how catastrophic forgetting would affect your use case. It might not be a problem if all you require is dependable performance on the one task you focused on during fine-tuning. You can undertake fine-tuning on several jobs at once if you wish or need the model to keep its generalised multitask capabilities.

It may take 50–100,000 samples spread across several tasks for good multitask fine-tuning, therefore more data and computing power will be needed to train. As an alternative to comprehensive fine-tuning, we can instead use parameter efficient fine-tuning, or PEFT. PEFT is a set of methods that trains just a few task-specific adaptor layers and parameters while maintaining the weights of the original LLM. Since the majority of the pre-trained weights remain constant, PEFT exhibits stronger resistance to catastrophic forgetting. PEFT is a fascinating and dynamic field of study.

6 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.