Evaluating Fine-Tuned Large Language Models

In this article we explore several metrics that are used by developers of large language models that you can use to assess the performance of your own models and compare to other models out in the world

Pranath Fernando


July 12, 2023

1 Introduction

When looking at large language models such as ChatGPT and others we might customise and fine tune to improve, we might often describe it by saying the model demonstrated good performance on this task or this fine-tuned model showed a large improvement in performance over the base model. But what do statements like this mean? How can you formalize the improvement in performance of your fine-tuned model over the pre-trained model you started with? In this article we explore several metrics that are used by developers of large language models that you can use to assess the performance of your own models and compare to other models out in the world.

2 Basic LLM Evaluation Metrics

When using conventional machine learning, you may evaluate a model’s performance by examining how well it performs on training and validation data sets when the output is previously known. Due to the deterministic nature of the models, it is possible to generate straightforward measures like accuracy, which expresses the percentage of true predictions out of all forecasts. Large language models, however, present a far greater challenge because the output is non-deterministic.

Take, for example, the sentence, Mike really loves drinking tea. This is quite similar to Mike adores sipping tea. But how do you measure the similarity? Let’s look at these other two sentences. Mike does not drink coffee, and Mike does drink coffee. There is only one word difference between these two sentences. However, the meaning is completely different. Now, for humans like us with squishy organic brains, we can see the similarities and differences. But when you train a model on millions of sentences, you need an automated, structured way to make measurements.

Two often used evaluation measures for various jobs are ROUGE and BLEU. By contrasting them with reference summaries created by humans, ROUGE, or recall oriented under study for jesting evaluation, is primarily used to grade the quality of automatically generated summaries. However, BLEU, or bilingual evaluation understudy, is an algorithm created to assess the quality of machine-translated text once again by contrasting it with translations produced by humans.

A unigram in the anatomy of language is the same as one word. An n-gram is a collection of n words, whereas a bigram is two words.

3 ROUGE Scores

Let’s look at a human-generated reference sentence. It is cold outside and a generated output that is very cold outside. Using recall, precision, and F1, you can carry out straightforward metric computations similar to other machine-learning tasks. By dividing the number of words or unigrams in the reference by the number of words or unigrams in the generated output, the recall metric calculates the percentage of words or unigrams that match. As all the created words correspond to words in the reference, it receives a perfect score of one in this instance.

The output size divided by the unigram matches represents precision. The harmonic mean of these two numbers is the F1 score. These are very basic metrics that, as indicated by the name, only focus on individual words and ignore the word order. It might be misleading. It is simple to create statements that are objectively good but score well. Imagine for a moment supposing the model had produced a sentence that was only one word different. The results would be identical. By considering bigrams, which are groups of two words, from the reference and generated sentence, you can somewhat improve your score.

Working with pairs of words allows you to acknowledge the sentence’s word order in a very basic way. You can figure out a ROUGE-2 using bigrams. Bigram matches can now be used to calculate the recall, precision, and F1 score instead of individual words. The scores are lower than the ROUGE-1 results, as you will see. Longer sentences increase the likelihood that bigrams won’t match, which could result in even lower scores. Let’s adopt a different strategy rather than keeping the ROUGE numbers increasing to n-grams of three or four.

Instead, you should search for the longest common subsequence that appears in both the reference output and the output that was generated. In this instance, it is and cold outside each have a length of two, making them the longest matching sub-sequences. The recall precision and F1 score may now be calculated using the LCS value, with the length of the longest common subsequence—in this example, two—serving as the numerator in both calculations. The Rouge-L score is the result of these three numbers added together. You must consider the values in context, as you should with all of the incorrect scores. Only if the scores were determined for the same task can you use them to compare the capabilities of different models.

Take summary, for instance. The results obtained by Rouge for various tasks are not similar. You now know that one issue with basic rouge scores is that it is possible for a poor completion to yield a good score. Take this generated result. This generated output will receive a good score even though the same term is used many times since it contains a word from the reference phrase. The Rouge-1 precision score is going to be flawless. The use of a clipping function to restrict the number of unigram matches to the unigram’s maximum count within the reference is one technique to address this problem.

There is only one instance of cold and the reference in this situation, therefore a modified precision with a clip on the unigram matches yields a significantly lower score. Even if all of their created words are present, they may be in a different order, and this will still present a challenge to you. For instance, the sentence that was formed is, “It is cold outside.” Due to the fact that all of the words and the resulting output are available in the reference, this sentence was correctly called even with the modified accuracy and clipping function. While experimenting with different rouge scores might be helpful, the language, sentence length, and your use case will determine what n-gram size will produce the most useful score.

You should be aware that many language model libraries, including Hugging Face, provide implementations of rouge score that you can use to quickly assess the output of your model.

4 BLEU Scores

The BLEU score, which stands for bilingual evaluation under study, is another metric that may be helpful in assessing the effectiveness of your model. Just a reminder that the BLEU score can be used to assess the quality of machine translated text. The average precision over various n-gram sizes is used to determine the score itself. Similar to the Rouge-1 score that we previously examined, but averaged over a range of n-gram sizes. Let’s examine this measurement’s characteristics and methodology in more detail. By counting how many n-grams in the machine-generated translation match those in the reference translation, the BLEU score measures the accuracy of a translation.

To calculate the score, you average precision across a range of different n-gram sizes. If you were to calculate this by hand, you would carry out multiple calculations and then average all of the results to find the BLEU score. For this example, let’s take a look at a longer sentence so that you can get a better sense of the scores value. Now, as you’ve seen these individual calculations in depth when you looked at rouge, I will show you the results of BLEU using a standard library.

Calculating the BLEU score is easy with pre-written libraries from providers like Hugging Face and I’ve done just that for each of our candidate sentences. The first candidate is, I am very happy that I am drinking a cup of tea. The BLEU score is 0.495. As we get closer and closer to the original sentence, we get a score that is closer and closer to one.

Both rouge and BLEU are very straightforward and inexpensive to calculate metrics. They can be used as quick references while you loop through your models, but you shouldn’t rely only on them to present the results of a comprehensive language model review. Use Rouge for diagnostic summarization task evaluation and BLEU for translation task evaluation. However, one of the evaluation standards that academics have created are best used in order to assess the overall performance of your model.

5 Benchmarks for Evaluation

Simple evaluation metrics, like as the rouge and blur scores, can only provide you with a limited amount of information about the capabilities of your model because LLMs are complex. You can use pre-existing datasets and related benchmarks that have been produced by LLM researchers expressly for this purpose to measure and compare LLMs more comprehensively. In order to properly evaluate an LLM’s performance and comprehend its genuine capabilities, choosing the appropriate evaluation dataset is crucial. You’ll find it helpful to choose datasets that isolate particular model skills, such as reasoning or common sense, as well as those that concentrate on possible dangers, such as disinformation or copyright infringement.

The model’s exposure to your evaluation data during training is a crucial factor that you should take into account. By assessing the model’s performance on data that it has never seen before, you’ll gain a more precise and meaningful understanding of its capabilities. Benchmarks like GLUE, SuperGLUE, or Helm cover a variety of scenarios and activities. They accomplish this by creating or gathering datasets that test particular LLM components. General Language Understanding Evaluation, or GLUE, was first made available in 2018.

GLUE is a group of activities that deal with natural language, including sentiment analysis and question-answering. You can use the benchmark to assess and contrast the model performance. GLUE was developed to promote the creation of models that can generalise across various workloads. SuperGLUE, a replacement for GLUE, was released in 2019 to remedy the shortcomings of the former. It consists of a number of activities, some of which are more difficult variations of earlier jobs and others of which are not present in GLUE. Reading comprehension and multi-sentence reasoning are among the activities included in SuperGLUE. There are leaderboards for the GLUE and SuperGLUE benchmarks that can be used to contrast and compare evaluated models.

Another excellent tool for monitoring the development of LLMs is the results page. When compared to benchmarks like SuperGLUE, larger models’ performance begins to match human performance on particular tasks. In other words, models are capable of performing on par with humans in benchmark tests, but we can tell subjectively that they don’t perform at a human level across tasks. The emergent LLM features and the benchmarks that try to assess them are in a kind of arms race. Here are a few current benchmarks that are advancing LLMs. For modern LLMs, Massive Multitask Language Understanding, or MMLU, was created.

Models must have strong problem-solving skills and a broad knowledge of the world to function well. Models are put to the test in a variety of fields, including elementary math, US history, computer technology, and law. In other words, activities that go well beyond simple language comprehension. 204 problems make up the current BIG-bench, which covers topics including linguistics, child development, math, common sense thinking, biology, physics, social bias, software development, and more. BIG-bench is available in three sizes, which is partially due to the fact that running such big benchmarks can result in high inference costs. You should be aware of the Holistic Evaluation of Language Models, or HELM, as a final benchmark.

The HELM framework attempts to increase model transparency and provide advice on which models work best for various tasks. In order to ensure that trade-offs between models and metrics are transparently highlighted, HELM employs a multimetric approach, assessing seven metrics over 16 basic scenarios. The fact that HELM evaluates on criteria other than the most fundamental accuracy indicators, such the F1 score precision, is a key feature.

The benchmark also contains measures for justice, prejudice, and toxicity, which are crucial to evaluate as LLMs develop their capacity for human-like language generation and, consequently, their capacity to display potentially damaging behaviour. HELM is a living benchmark that seeks to constantly change by including fresh scenarios, data, and models. You can explore the LLMs that have been assessed on the results page and go at any scores that are relevant to the requirements of your project.

6 LLM Based Evaluation

More recently, some have highlighted the very obvious drawbacks to BLEU and ROUGE score approaches, and of course even fixed benchmarks have drawbacks too as they struggle to keep pace with rapid developments with LLMs. Some have suggested a radically new approach which is to actually use LLM’s themselves to evaluate the outputs of other LLMS.

As Ehud Reiter highlights in a recent article:

‘…At the time of writing, there is a lot of excitement about using LLMs to evaluate generated texts. I was especially impressed by a recent paper by Kocmi and Federman, which showed that GPT 3.5 could evaluate machine translation texts better than existing metrics, using straightforward prompts, no examples, and no reference texts. I know a lot of other people are exploring this space, and it seems plausible to me that LLM-based evaluation could replace BLEU, BLEURT, lower-quality human evaluations, etc. Which I think overall would be a good thing, maybe we’ll finally see the end of BLEU…..maybe I’m being naive, but I especially hope that we will finally end the use of BLEU (and ROUGE). I’ve been complaining about BLEU for almost 20 years (paper), as have others. There is no scientific justification for its use in 2023, but unfortunately many researchers are reluctant to change. Sometimes a “shock” is the best agent of change, and perhaps LLMs can provide such a shock!’

I have myself illustrated how this could done in an earlier project where I used langchain and an llm to evaluate another llm application where i argued:

‘…So if we were going to do some kind of string matching for evaluation or one based on similar words such as the NLP text similarity metric BLEU score it would not work because the similarity is not based on superficial aspects of language such as words but deeper aspects of language such as meaning. And this is exactly the kind of understanding that language models can do, which are not based on any kind of specific rule. This is what makes evaluation of language models so hard in the first place, but ironically enables us to use language models to solve it. This makes previous NLP evaluation metrics such as the BLEU score inadaquate for evaluatiing these more complex models, so we need to invent new ones such as this method - which is one of the most popular methods currently.’

7 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by DeepLearning.ai and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.