LivingDataLab - Parameter Efficient Fine-Tuning (PEFT) for Large Language Models

1 Introduction

It takes a lot of computation to train LLMs. Memory is needed for complete fine-tuning not just to store the model but also a number of other training-related factors. You must be able to allocate memory for optimizer states, gradients, forward activations, and temporary memory throughout the training process even if your computer can hold the model weights, which are currently on the order of hundreds of terabytes for the largest models. These extra parts may be many times bigger than the model and can easily outgrow the capabilities of consumer hardware.

Parameter efficient fine tuning techniques only modify a restricted number of parameters, as opposed to full fine-tuning, which modifies every model weight during supervised learning.

2 Parameter Efficient Fine Tuning (PEFT)

Some PEFT strategies freeze the majority of the model weights and concentrate on fine-tuning a portion of the already-existing model parameters, such as specific layers or components. Other methods just add a few new parameters or layers and fine-tune them, leaving the existing model weights untouched. Most of the LLM weights, if not all of them, are kept frozen with PEFT. As a result, there are far fewer trained parameters than there were in the original LLM. Occasionally, just 15–25% of the LLM weights from the beginning. As a result, the memory needs for training become much more manageable.

In reality, PEFT can frequently be completed on a single GPU. Additionally, PEFT is less vulnerable to the catastrophic forgetting issues of full fine-tuning because the original LLM is only marginally changed or left unchanged. Every task you train on generates a new version of the model after full fine-tuning. Because they are all the same size as the original model, if you are fine-tuning for several activities, it might become a costly storage issue. Let’s look at how PEFT can help to make things better. With parameter efficient fine-tuning, you train fewer weights overall, resulting in a considerably lower footprint overall—depending on the workload, this footprint can be as small as a few gigabytes.

The original LLM weights and the additional parameters are merged for inference. The original model may be efficiently adapted to many tasks since the PEFT weights are trained for each task and are simple to swap out for inference. For parameter efficient fine-tuning, there are a number of approaches that you may apply, but each has trade-offs in terms of parameter efficiency, memory efficiency, training speed, model quality, and inference costs. Let’s examine the three primary categories of PEFT approaches. Selective techniques focus on adjusting just a portion of the initial LLM parameters.

You can choose from a number of methods to determine the parameters you wish to alter. You can choose to train only a portion of the model, a set of layers, or even a single kind of parameter.

Researchers have discovered that there are noticeable trade-offs between parameter efficiency and computation efficiency, and that these approaches perform inconsistently. Although reparameterization techniques also use the original LLM parameters, they do so by generating fresh low rank transformations of the initial network weights.

LoRA is one such approach that is frequently employed. Last but not least, additive techniques do fine-tuning by leaving all of the initial LLM weights frozen and adding new trainable components. Two basic strategies are included here. Typically located in the encoder or decoder components following the attention or feed-forward layers, adapter methods extend the architecture of the model by adding new trainable layers.

On the other hand, soft prompt approaches maintain a fixed and frozen model architecture and concentrate on modifying the input to enhance performance. This can be accomplished by either maintaining the input constant and retraining the embedding weights, or by adding trainable parameters to the prompt embeddings.

3 PEFT Method 1: LoRA

Low-rank Adaptation, or LoRA for short, is a re-parameterization method for fine-tuning that is parameter-efficient. Let’s examine its operation. Here is the diagram of the transformer architecture. Tokens created from the input prompt are then transformed into embedding vectors and sent to the encoder and/or decoder sections of the transformer. There are two different types of neural networks—self-attention and feedforward networks—in both of these components. These networks’ weights are taught during the pre-training phase.

The self-attention layers receive the produced embedding vectors and use a number of weights to determine the attention scores. Every parameter in these levels is updated during thorough fine-tuning. By freezing all of the initial model parameters and then infusing two rank decomposition matrices with the original weights, the LoRA method lowers the number of parameters that must be trained during fine-tuning. The smaller matrices’ size are chosen so that the final matrix has the same dimensions as the weights they are changing.

The smaller matrices are trained using a supervised learning procedure while maintaining the original weights of the LLM. For inference, a matrix with the same dimensions as the frozen weights is produced by multiplying the two low-rank matrices together. The old weights are then combined with this, and the model is then updated with these new values. You now possess a LoRA model that has been optimised to perform your particular purpose. Inference latency is hardly affected because this model has the same amount of parameters as the original. In order to fine-tune for a task and improve performance, researchers have discovered that applying LoRA to just the self-attention layers of the model achieves good results.

However, in theory, LoRA can also be used on other parts, such as feed-forward layers. However, applying LoRA to these weight matrices results in the highest reductions in trainable parameters because the majority of the LLMs’ parameters are in the attention layers. The transformer architecture described in the Attention is All You Need paper will be used ifor an example to illustrate. The size of the transformer weights, according to the original paper, are 512 by 64. This indicates that there are 32,768 trainable parameters for each weights matrix. Alternatively, we would train two tiny rank decomposition matrices with an eight-dimensional short dimension if we utilise LoRA as a fine-tuning strategy.

This implies that Matrix A will have 512 total parameters and 8 by 64 dimensions. The size of Matrix B will be 512 by 8, or 4,096 trainable parameters. You can reduce the number of training parameters by 86% by updating the weights of these new low-rank matrices rather than the original weights. With LoRA, you can drastically minimise the amount of trainable parameters, thus you don’t always require a distributed cluster of GPUs to carry out this kind of parameter efficient fine tuning. Since each task may be fine-tuned using a separate set of rank-decomposition matrices, you can switch between them at inference time by adjusting the weights.

Let’s say you train two LoRA matrices for the purpose of performing Task A. You would combine both matrices together and then add the resulting matrix to the initial frozen weights to do inference on this task. Then, using this updated weights matrix, you can swap out the old weights wherever they occur in your model.

After that, you can perform inference on Task A using this model. Instead, if you want to complete a different task, let’s say Task B, you only need to take the LoRA matrices you trained for it, figure out what their product is, add it to the initial weights, and update the model once more. These LoRA matrices only need a tiny amount of memory to be stored.

So in theory, LoRA can be used to train for a variety of tasks. To avoid having to store numerous full-size versions of the LLM, simply swap out the weights as needed. How reliable are these models? Let’s use the ROUGE measure to assess how well a LoRA fine-tune model performs in comparison to both the original base model and a fully fine-tuned version. Let’s concentrate on optimising the FLAN-T5 for dialogue synthesis. Just to refresh your memory, a substantial instruction data set was used for the initial set of comprehensive fine-tuning on the FLAN-T5-base model.

For the FLAN-T5 base model and the summarization data set we previously described, let’s first establish a baseline score. The ROUGE scores for the base model are shown below, with higher values indicating greater performance. For this discussion, you should concentrate on the ROUGE 1 score, however you can compare any of these scores. The scores are fairly low as you can see. Check the results for a model that has had extra complete dialogue summarization fine-tuning. Remember that even though the FLAN-T5 is a good model, some additional task-specific fine-tuning may be beneficial. When using comprehensive fine-tuning, supervised learning is used to update every aspect of the model.

As you can see, this causes the ROUGE 1 score to increase significantly above the baseline FLAN-T5 model by 0.19. The model’s performance on the summarising task has significantly improved thanks to the second round of fine-tuning. Let’s now examine the results for the LoRA fine-tune model. As you can see, this procedure also significantly improved performance. From the starting point, the ROUGE 1 score has increased by 0.17. This is only little less than full fine-tuning. Nevertheless, employing LoRA for fine-tuning learned a lot fewer parameters than full fine-tuning using a lot less computing, so this minor performance trade-off may very well be worthwhile.

You may be asking how to select the LoRA matrices’ rank. It’s a good question, and the field of study is still active. In general, there are fewer trainable parameters and greater compute savings the lower the rank is. There are, however, a few model performance-related considerations to take into account. Researchers from Microsoft looked into how different rank selections affected the model’s performance on language generation tasks in the study that first introduced LoRA. The table here is a summary of the findings.

The table displays the ultimate loss value of the model, the rank of the LoRA matrices in the first column, and the scores for other metrics, including BLEU and ROUGE. The best results for each statistic are represented by the values in bold. For ranks higher than 16, the loss value reached a plateau, according to the authors. In other words, performance wasn’t enhanced by employing larger LoRA matrices. The lesson learned from this is that ranks between 4 and 32 can offer you a good trade-off between lowering trainable characteristics and maintaining performance.

As more practitioners employ LoRA, there may be an evolution in best practises for optimising the selection of rank. LoRA is an effective fine-tuning technique that produces excellent performance. The method’s guiding concepts apply to training models across domains, not just LLMs.

4 PEFT Method 2: Soft Prompts

By using LoRA, we were able to update the model’s weights effectively without having to retrain any of the parameters. PEFT also includes additive techniques that try to enhance model performance without altering the weights in any way. You’ll learn about soft prompts aka prompt tuning, a second technique for parameter-efficient fine tuning here. Though they sound similar, prompt tuning and prompt engineering are very distinct from one another. Prompt engineering is modifying the language of your prompt to get the desired completion. Changing the words or phrases you use could be as simple as that, or it could be more difficult like giving examples of one-shot or few-shot inference.

The objective is to aid the model’s comprehension of the nature of the work you are asking it to perform and to improve the completion. Prompt engineering has significant drawbacks, though, in that creating and testing various prompts can be labor-intensive. The length of the context window is another restriction, and sometimes you can still not get the performance you require for your operation using this approach.

Prompt tuning involves adding more trainable tokens to your prompt and letting the supervised learning procedure decide what their ideal values should be. A soft prompt is a collection of trainable tokens that is prepended to embedding vectors that reflect the text in your input.

The soft prompt vectors are the same size as the language token embedding vectors. And for good performance, between 20 and 100 virtual tokens may be sufficient. Since each token corresponds to a specific place in the embedding vector space, the tokens used to represent natural language are fixed to specific words. The soft prompts, on the other hand, are not set, definite terms of natural language. As an alternative, consider them to be virtual tokens that can have any value in the continuous multidimensional embedding space. The model also learns the values for these virtual tokens that maximise performance for a particular task using supervised learning.

The training data set includes input prompts and output completions or labels in complete fine tuning. During supervised learning, the large language model’s weights are updated. Contary to prompt tuning the large language model’s weights are fixed, and the underlying model is not changed. Instead, to improve the model’s completion of the prompt, the embedding vectors of the soft prompt are changed over time.

Given that only a small number of parameters are being leanred, prompt tuning is a relatively parameter-efficient method compared to the millions to billions of parameters used for full fine tuning, as we observed with LoRA.

For each job, you can train a separate set of soft prompts, and when it comes time for inference, you can switch them out. For one job, you can teach one set of soft prompts, and another set for a different task. To move to another task, you just modify the soft prompt. To utilise them for inference, you prepend your input prompt with the learnt tokens. Since soft prompts take up very little space on disc, this type of fine tweaking is very effective and versatile. You’ll see that the LLM is the same for all tasks; all you have to do is change the soft prompts when it comes time for inference. So how effective is prompt tuning? Brian Lester and colleagues at Google looked at this in the original paper.

For a variety of model sizes, the authors contrasted prompt tuning with a few alternative techniques. You can see the Model size on the X axis and the SuperGLUE score on the Y axis in this paper’s figure. This is the evaluation benchmark that grades use to evaluate performance on a variety of different language tasks, as you learned about earlier this week. The results of models that underwent exhaustive fine tuning on a single job are represented by the red line. The score for models developed utilising multitask fine tuning is represented by the orange line. The performance of prompt tuning is displayed on the green line, while only scores for prompt engineering are displayed on the blue line.

As you can see, prompt tuning is less effective for smaller LLMs than full fine tuning. However, prompt tuning’s effectiveness decreases with model size. Additionally, prompt tuning, which provides a considerable performance improvement over prompt engineering alone, can be just as successful as full fine tuning once models have around 10 billion parameters. The interpretability of learned virtual tokens is one potential problem to take into account. Please keep this in mind as the soft prompt tokens might have any value in the continuous embedding vector space. No known token, word, or phrase in the LLM’s vocabulary corresponds to the training tokens.

However, a closer look at the tokens that are closest to the soft prompt location reveals that they organise into compact semantic clusters. In other words, the words with meanings most comparable to the soft prompt tokens are nearest to them. The fact that the words are frequently tied to the activity in some way suggests that the prompts are teaching word-like representations. In this session, you looked at two PEFT techniques, including LoRA, which effectively updates the model parameters using rank decomposition matrices. Additionally, prompt tuning adds trainable tokens while leaving the model weights alone.

Both techniques let you fine-tune models with the ability to do your jobs more effectively while utilising a lot less computing power than full fine-tuning techniques. Due to its performance being on par with full fine tuning for a wide range of jobs and data sets, LoRA is widely employed in practise.

5 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by DeepLearning.ai and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Parameter Efficient Fine-Tuning (PEFT) for Large Language Models

Subscribe