Evaluating the outputs of Large Language Model Applications for Clear Criteria

Here we look at some best practices for evaluating the outputs of an LLM application when you do have a clear sense of the right output - to help us know before and after deployment how well its working
natural-language-processing
deep-learning
openai
llm-evaluation
Author

Pranath Fernando

Published

June 25, 2023

1 Introduction

Large language models such as ChatGPT can generate text responses based on a given prompt or input. Writing prompts allow users to guide the language model’s output by providing a specific context or topic for the response. This feature has many practical applications, such as generating creative writing prompts, assisting in content creation, and even aiding in customer service chatbots.

In earlier articles i’ve looked at how you can use ChatGPT to solve some of these tasks with simple prompts. But in many use cases, what is required is not just one prompt but a sequence of prompts where we need to also consider the outputs at each stage, before providing a final output - for example with a customer service chatbot.

We have also seen previously how to build an application using an LLM from evaluating the inputs to processing the inputs to then doing final output checking before you show the output to the user. After you’ve built such a system, how do you know how it’s working? And maybe even as you deploy it and let users use it, how can you track how it’s doing and find any shortcomings and continue to improve the quality of the answers of your system?

In this article, we will look at some best practices for evaluating the outputs of an LLM when we have a clearer sense of the outputs we want, and show what it feels like to build one of these systems.

2 Setup

First we need to load certain python libs and connect the OpenAi api.

The OpenAi api library needs to be configured with an account’s secret key, which is available on the website.

You can either set it as the OPENAI_API_KEY environment variable before using the library: !export OPENAI_API_KEY='sk-...'

Or, set openai.api_key to its value:

import openai
openai.api_key = "sk-..."
import os
import openai
import sys
sys.path.append('../..')
import utils
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']
# Define helper function
def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message["content"]

3 Best Practices for Evaluating Large Language Models

3.1 Differences between tradtional machine learning and LLM models

A significant difference between evaluating LLMs and evaluating more conventional machine learning supervised learning applications is that because such applications can be built so quickly, the methodologies for evaluating them frequently do not begin with a test set. Instead, you frequently find yourself progressively assembling a collection of test instances. Let’s examine what this implies.

Recall from earlier articles how prompt-based development reduces the duration of the key phases of model creation from possibly months to only a few minutes, hours, or at most a few days. In the conventional supervised learning approach, the incremental cost of collecting an additional 1,000 test examples isn’t that high if you already required to gather, let’s say, 10,000 labelled examples. Therefore, it was common practise in the classic supervised learning context to gather a training set, gather a development set, or gather a holdout cross-validation set in the test set, and then tap those available during this development phase.

However, if you can specify a prompt in a matter of minutes and get something up and running in a matter of hours, it would seem like a major inconvenience to have to stop for a considerable amount of time to gather 1,000 test samples because you cannot get something to function with no training examples. So, this is how it typically feels while developing an application utilising an LLM. You would first fine-tune the prompts using just a few examples—perhaps one to three or five—and try to find a prompt that applies to them. Then you encounter a few challenging examples as you put the system through extra testing.

They are incompatible with either the algorithm or the prompt. And in that scenario, you can just add more challenging cases by taking these extra one, two, three, or five examples and adding them to the collection that you’re testing on. Once you’ve added enough of these examples to your gradually expanding development collection, it gets a little annoying to have to manually run each example through the prompt each time the prompt is changed.

Then you start creating measures to gauge success on this condensed collection of samples, such perhaps average accuracy.

3.2 Iterative LLM testing

And an intriguing feature of this method is that you can stop at any point and skip to the next bullet if you determine your system is operating well enough. In fact, a lot of deployed programmes stop working at the first or second bullet and continue to function well. The next stage is to get a randomly sampled collection of instances to tweak the model to if your hand-built development set, which you are using to evaluate it, isn’t giving you enough confidence in the performance of your system yet.

Given that it would be normal practise to keep tailoring your prompt to this, this would continue to be a development set or hold-out cross-validation set. And only if you require an even higher level of realism in your estimate of the system’s performance should you gather and employ hold-out test sets that you don’t even look at when fine-tuning the model. Therefore, step four is usually more crucial if, for example, your system is only providing the correct answer 91% of the time and you want to tune it so that it provides the correct answer 92% or 93% of the time. In this case, you will need a larger sample size to assess the differences between 91% and 93% performance.

The only time you would need to collect a hold-out test set in addition to the development set would be if you truly needed an objective, fair assessment of how the system was performing. One crucial qualification: consider big language models where the risk of harm if it provides an incorrect answer isn’t meaningful.

However, it goes without saying that for any high-stakes applications, if there is a possibility of bias or an inappropriate output harming someone, the responsibility to gather a test set and rigorously evaluate your system’s performance to ensure it is acting correctly before you use it becomes much more important.

However, if you are using it, for instance, to summarise articles solely for your own reading and no one else’s, then perhaps the risk of harm is less significant, and you can stop this process early without incurring the costs of bullets four and five and gathering larger data sets on which to evaluate your algorithm.

4 Get the relevant products and categories

So for our example, we will start with the usual helper functions. We will use the utils function to get a list of products and categories.

The utils python module and json used for this example can be found in this github location.

Here is the list of products and categories that are in the product catalog.

products_and_category = utils.get_products_and_category()
products_and_category
{'Computers and Laptops': ['TechPro Ultrabook',
  'BlueWave Gaming Laptop',
  'PowerLite Convertible',
  'TechPro Desktop',
  'BlueWave Chromebook'],
 'Smartphones and Accessories': ['SmartX ProPhone',
  'MobiTech PowerCase',
  'SmartX MiniPhone',
  'MobiTech Wireless Charger',
  'SmartX EarBuds'],
 'Televisions and Home Theater Systems': ['CineView 4K TV',
  'SoundMax Home Theater',
  'CineView 8K TV',
  'SoundMax Soundbar',
  'CineView OLED TV'],
 'Gaming Consoles and Accessories': ['GameSphere X',
  'ProGamer Controller',
  'GameSphere Y',
  'ProGamer Racing Wheel',
  'GameSphere VR Headset'],
 'Audio Equipment': ['AudioPhonic Noise-Canceling Headphones',
  'WaveSound Bluetooth Speaker',
  'AudioPhonic True Wireless Earbuds',
  'WaveSound Soundbar',
  'AudioPhonic Turntable'],
 'Cameras and Camcorders': ['FotoSnap DSLR Camera',
  'ActionCam 4K',
  'FotoSnap Mirrorless Camera',
  'ZoomMaster Camcorder',
  'FotoSnap Instant Camera']}

There is a list of computers and laptops in the category “computers and laptops,” a list of smartphones and accessories is provided in the category “smartphones and accessories,” and so on for other categories. Let’s imagine the task we’re going to tackle is to extract the pertinent categories and items in order to have the knowledge necessary to respond to the user’s query, which can be something like, “What TV can I buy if I’m on a budget?”

5 Find relevant product and category names

The prompt gives the language model one example of a desirable output while also describing a set of instructions. Because we’re actually combining a user message plus a system message to provide it with one example of a suitable output, this is frequently referred to as a few-shot or technically one-shot prompting. Whenever a person declares, “I want the most expensive computer.” We don’t have pricing information, so let’s just return all the computers. When a consumer asks, “Which TV can I buy if I’m on a budget?,” let’s utilise this prompt. Therefore, we are adding the customer message zero prompt as well as the items and category to this. This is the data that the utils function allowed us to retrieve at the top.

This could be the version that is running in production.

def find_category_and_product_v1(user_input,products_and_category):

    delimiter = "####"
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a python list of json objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>


    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    

    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.

    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    

    """
    
    few_shot_user_1 = """I want the most expensive computer."""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

6 Evaluate on some queries

Here it lists out the relevant information to this query, which is under the category, televisions and whole theater systems. This is a list of TVs and whole theater systems that seem relevant. To see how well the prompt is doing, you may evaluate it on a second prompt. The customer says, “I need a charger for my smartphone.”. It looks like it’s correctly retrieving this data. Category, smartphones, accessories, and it lists the relevant products. And here’s another one. So, “What computers do you have?”. And hopefully you’ll retrieve a list of the computers.

customer_msg_0 = f"""Which TV can I buy if I'm on a budget?"""

products_by_category_0 = find_category_and_product_v1(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)
    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]
customer_msg_1 = f"""I need a charger for my smartphone"""

products_by_category_1 = find_category_and_product_v1(customer_msg_1,
                                                      products_and_category)
print(products_by_category_1)
    [{'category': 'Smartphones and Accessories', 'products': ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']}]
customer_msg_2 = f"""
What computers do you have?"""

products_by_category_2 = find_category_and_product_v1(customer_msg_2,
                                                      products_and_category)
products_by_category_2
"    [{'category': 'Computers and Laptops', 'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]"

Here I’ve got three prompts, and if you’re developing this prompt for the first time, it would be reasonable to have one, two, or three examples like this. Then you could keep fine-tuning the prompt until it produces the right results, or until it retrieves the appropriate products and categories in accordance with the customer request for all of your prompts, in this case all three of them. And if the prompt had been deficient in any way, such as missing some products or whatever, we would probably go back and revise the prompt several times until it was accurate for all three of these prompts. Once you’ve reached this stage with the system, you might start using it for testing.

Sometimes you run across a prompt that it fails. Here’s an example of a prompt, “tell me about the smartx pro phone and the fotosnap camera. Also, what TVs do you have?”.

customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""

products_by_category_3 = find_category_and_product_v1(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)
    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']},
     {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']},
     {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]
     
    Note: The query mentions "smartx pro phone" and "fotosnap camera, the dslr one", so the output includes the relevant categories and products. The query also asks about TVs, so the relevant category is included in the output.

So even though it appears to be producing the correct data when I run it on this prompt, it also produces a lot of text and other unnecessary information. This makes parsing it into a Python list of dictionaries more difficult. So we don’t want it producing this extra waste. Therefore, standard practise is to simply note that an example is tough when the system fails on it. As a result, let’s add this example to the list of examples we’ll use to methodically test the system.

Additionally, if you run the system for a little while longer, perhaps it will work with those cases. Although we did adapt the prompt to three cases, we can’t guarantee that it will work on all examples. You may, purely by accident, come across another example where it produces an error. Because of this, the system additionally outputs unwanted trash text at the end of the personalised message.

7 Harder test cases

Lets try to identify queries found in production, where the model is not working as expected.

customer_msg_4 = f"""
tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?"""

products_by_category_4 = find_category_and_product_v1(customer_msg_4,
                                                      products_and_category)
print(products_by_category_4)
    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 8K TV']},
     {'category': 'Gaming Consoles and Accessories', 'products': ['GameSphere X']},
     {'category': 'Computers and Laptops', 'products': ['BlueWave Chromebook']}]
     
    Note: The CineView TV mentioned is the 8K one, and the Gamesphere console mentioned is the X one. 
    For the computer category, since the customer mentioned being on a budget, we cannot determine which specific product to recommend. 
    Therefore, we have included all the products in the Computers and Laptops category in the output.

At this point, you may have tested this prompt on hundreds of examples or with test subjects, but you would only use the examples where the tricky ones performed poorly. I now have a set of five examples, numbered 0 through 4, which you can use to further improve the prompts. Additionally, the LLM generated a lot of unnecessary trash text in both of these cases that we don’t need. Following some trial and error, you might want to change the prompts as follows.

8 Modify the prompt to work on the hard test cases

So here’s a new prompt, this is called prompt v2. But what we did here was we added to the prompt, “Do not output any additional text that’s not in JSON format.”, just to emphasize, please don’t output this JSON stuff. And added a second example using the user and assistant message for few-shot prompting where the user asked for the cheapest computer. And in both of the few-shot examples, we’re demonstrating to the system a response where it gives only JSON outputs. So here’s the extra thing that we just added to the prompt, “Do not output any additional text that’s not in JSON formats.”, and we use “few_shot_user_1”, “few_shot_assistant_1”, and “few_shot_user_2”, “few_shot_assistant_2” to give it two of these few shot prompts.

def find_category_and_product_v2(user_input,products_and_category):
    """
    Added: Do not output any additional text that is not in JSON format.
    Added a second example (for few-shot prompting) where user asks for 
    the cheapest computer. In both few-shot examples, the shown response 
    is the full list of products in JSON only.
    """
    delimiter = "####"
    system_message = f"""
    You will be provided with customer service queries. \
    The customer service query will be delimited with {delimiter} characters.
    Output a python list of json objects, where each object has the following format:
        'category': <one of Computers and Laptops, Smartphones and Accessories, Televisions and Home Theater Systems, \
    Gaming Consoles and Accessories, Audio Equipment, Cameras and Camcorders>,
    AND
        'products': <a list of products that must be found in the allowed products below>
    Do not output any additional text that is not in JSON format.
    Do not write any explanatory text after outputting the requested JSON.


    Where the categories and products must be found in the customer service query.
    If a product is mentioned, it must be associated with the correct category in the allowed products list below.
    If no products or categories are found, output an empty list.
    

    List out all products that are relevant to the customer service query based on how closely it relates
    to the product name and product category.
    Do not assume, from the name of the product, any features or attributes such as relative quality or price.

    The allowed products are provided in JSON format.
    The keys of each item represent the category.
    The values of each item is a list of products that are within that category.
    Allowed products: {products_and_category}
    

    """
    
    few_shot_user_1 = """I want the most expensive computer. What do you recommend?"""
    few_shot_assistant_1 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    few_shot_user_2 = """I want the most cheapest computer. What do you recommend?"""
    few_shot_assistant_2 = """ 
    [{'category': 'Computers and Laptops', \
'products': ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook']}]
    """
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': f"{delimiter}{few_shot_user_1}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': f"{delimiter}{few_shot_user_2}{delimiter}"},  
    {'role':'assistant', 'content': few_shot_assistant_2 },
    {'role':'user', 'content': f"{delimiter}{user_input}{delimiter}"},  
    ] 
    return get_completion_from_messages(messages)

8.1 Evaluate the modified prompt on the hard tests cases

If you were to manually run this prompt on each of the five examples of user inputs, including the one that had previously produced a damaged output, you would discover that it now produces th desired result. This updated prompt, prompt version v2, will produce a better output if you run it again on the customer message example that produced the broken output with extra trash following the JSON output.

customer_msg_3 = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs do you have?"""

products_by_category_3 = find_category_and_product_v2(customer_msg_3,
                                                      products_and_category)
print(products_by_category_3)
    [{'category': 'Smartphones and Accessories', 'products': ['SmartX ProPhone']}, {'category': 'Cameras and Camcorders', 'products': ['FotoSnap DSLR Camera']}, {'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]

8.2 Regression testing: verify that the model still works on previous test cases

Let’s check that modifying the model to fix the hard test cases does not negatively affect its performance on previous test cases.

And of course, when you modify the prompts, it’s also useful to do a bit of regression testing to make sure that when fixing the incorrect outputs on prompts 3 and 4, it didn’t break the output on prompt 0 either. Now, you can kind of tell that if I had to copy-paste 5 prompts, customers such as 0, 1, 2, 3, and 4, into my Jupyter notebook and run them and then manually look at them to see if they output in the right categories and products. You can kind of do it. I can look at this and go, “Yep, category, TV and home theater systems, products. Yep, looks like you got all of them.”.

customer_msg_0 = f"""Which TV can I buy if I'm on a budget?"""

products_by_category_0 = find_category_and_product_v2(customer_msg_0,
                                                      products_and_category)
print(products_by_category_0)
    [{'category': 'Televisions and Home Theater Systems', 'products': ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']}]

9 Gather development set for automated testing

But it’s actually a little bit painful to do this manually, to manually inspect or to look at this output to make sure with your eyes that this is exactly the right output. So when the development set that you’re tuning to becomes more than just a small handful of examples, it then becomes useful to start to automate the testing process. So here is a set of 10 examples where I’m specifying 10 customer messages. So here’s a customer message, “Which TV can I buy if I’m on a budget?” as well as what’s the ideal answer. Think of this as the right answer in the test set, or really, I should say development set, because we’re actually tuning to this. And so we’ve collected here 10 examples indexed from 0 through 9, where the last one is if the user says, “I would like hot tub time machine.”. We have no relevant products to that, really sorry, so the ideal answer is the empty set.

And now, if you want to evaluate automatically, what the prompt is doing on any of these 10 examples, here is a function to do so. It’s kind of a long function.

msg_ideal_pairs_set = [
    
    # eg 0
    {'customer_msg':"""Which TV can I buy if I'm on a budget?""",
     'ideal_answer':{
        'Televisions and Home Theater Systems':set(
            ['CineView 4K TV', 'SoundMax Home Theater', 'CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV']
        )}
    },

    # eg 1
    {'customer_msg':"""I need a charger for my smartphone""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['MobiTech PowerCase', 'MobiTech Wireless Charger', 'SmartX EarBuds']
        )}
    },
    # eg 2
    {'customer_msg':f"""What computers do you have?""",
     'ideal_answer':{
           'Computers and Laptops':set(
               ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'
               ])
                }
    },

    # eg 3
    {'customer_msg':f"""tell me about the smartx pro phone and \
    the fotosnap camera, the dslr one.\
    Also, what TVs do you have?""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['SmartX ProPhone']),
        'Cameras and Camcorders':set(
            ['FotoSnap DSLR Camera']),
        'Televisions and Home Theater Systems':set(
            ['CineView 4K TV', 'SoundMax Home Theater','CineView 8K TV', 'SoundMax Soundbar', 'CineView OLED TV'])
        }
    }, 
    
    # eg 4
    {'customer_msg':"""tell me about the CineView TV, the 8K one, Gamesphere console, the X one.
I'm on a budget, what computers do you have?""",
     'ideal_answer':{
        'Televisions and Home Theater Systems':set(
            ['CineView 8K TV']),
        'Gaming Consoles and Accessories':set(
            ['GameSphere X']),
        'Computers and Laptops':set(
            ['TechPro Ultrabook', 'BlueWave Gaming Laptop', 'PowerLite Convertible', 'TechPro Desktop', 'BlueWave Chromebook'])
        }
    },
    
    # eg 5
    {'customer_msg':f"""What smartphones do you have?""",
     'ideal_answer':{
           'Smartphones and Accessories':set(
               ['SmartX ProPhone', 'MobiTech PowerCase', 'SmartX MiniPhone', 'MobiTech Wireless Charger', 'SmartX EarBuds'
               ])
                    }
    },
    # eg 6
    {'customer_msg':f"""I'm on a budget.  Can you recommend some smartphones to me?""",
     'ideal_answer':{
        'Smartphones and Accessories':set(
            ['SmartX EarBuds', 'SmartX MiniPhone', 'MobiTech PowerCase', 'SmartX ProPhone', 'MobiTech Wireless Charger']
        )}
    },

    # eg 7 # this will output a subset of the ideal answer
    {'customer_msg':f"""What Gaming consoles would be good for my friend who is into racing games?""",
     'ideal_answer':{
        'Gaming Consoles and Accessories':set([
            'GameSphere X',
            'ProGamer Controller',
            'GameSphere Y',
            'ProGamer Racing Wheel',
            'GameSphere VR Headset'
     ])}
    },
    # eg 8
    {'customer_msg':f"""What could be a good present for my videographer friend?""",
     'ideal_answer': {
        'Cameras and Camcorders':set([
        'FotoSnap DSLR Camera', 'ActionCam 4K', 'FotoSnap Mirrorless Camera', 'ZoomMaster Camcorder', 'FotoSnap Instant Camera'
        ])}
    },
    
    # eg 9
    {'customer_msg':f"""I would like a hot tub time machine.""",
     'ideal_answer': []
    }
    
]

10 Evaluate test cases by comparing to the ideal answers

So the customer message is,“Which TV can I buy if I’m on a budget?”. And let’s also print out the ideal answer. So the ideal answer is here are all the TVs that we want the prompt to retrieve. And let me now call the prompt. This is prompt V2 on this customer message with that user products and category information. Let’s print it out and then we’ll call the eval.

To determine how closely the response adheres to the ideal response, we will use the eval response of ideal function. And in this instance, it did provide the desired category and the whole list of products. As a result, it receives a score of 1.0. Just to give you one more example, it turns out that I am aware that example 7 is where it goes wrong. This is what I get if I change this from 0 to 7 and run it.

import json
def eval_response_with_ideal(response,
                              ideal,
                              debug=False):
    
    if debug:
        print("response")
        print(response)
    
    # json.loads() expects double quotes, not single quotes
    json_like_str = response.replace("'",'"')
    
    # parse into a list of dictionaries
    l_of_d = json.loads(json_like_str)
    
    # special case when response is empty list
    if l_of_d == [] and ideal == []:
        return 1
    
    # otherwise, response is empty 
    # or ideal should be empty, there's a mismatch
    elif l_of_d == [] or ideal == []:
        return 0
    
    correct = 0    
    
    if debug:
        print("l_of_d is")
        print(l_of_d)
    for d in l_of_d:

        cat = d.get('category')
        prod_l = d.get('products')
        if cat and prod_l:
            # convert list to set for comparison
            prod_set = set(prod_l)
            # get ideal set of products
            ideal_cat = ideal.get(cat)
            if ideal_cat:
                prod_set_ideal = set(ideal.get(cat))
            else:
                if debug:
                    print(f"did not find category {cat} in ideal")
                    print(f"ideal: {ideal}")
                continue
                
            if debug:
                print("prod_set\n",prod_set)
                print()
                print("prod_set_ideal\n",prod_set_ideal)

            if prod_set == prod_set_ideal:
                if debug:
                    print("correct")
                correct +=1
            else:
                print("incorrect")
                print(f"prod_set: {prod_set}")
                print(f"prod_set_ideal: {prod_set_ideal}")
                if prod_set <= prod_set_ideal:
                    print("response is a subset of the ideal answer")
                elif prod_set >= prod_set_ideal:
                    print("response is a superset of the ideal answer")

    # count correct over total number of items in list
    pc_correct = correct / len(l_of_d)
        
    return pc_correct
print(f'Customer message: {msg_ideal_pairs_set[7]["customer_msg"]}')
print(f'Ideal answer: {msg_ideal_pairs_set[7]["ideal_answer"]}')
Customer message: What Gaming consoles would be good for my friend who is into racing games?
Ideal answer: {'Gaming Consoles and Accessories': {'GameSphere X', 'GameSphere Y', 'GameSphere VR Headset', 'ProGamer Racing Wheel', 'ProGamer Controller'}}
response = find_category_and_product_v2(msg_ideal_pairs_set[7]["customer_msg"],
                                         products_and_category)
print(f'Resonse: {response}')

eval_response_with_ideal(response,
                              msg_ideal_pairs_set[7]["ideal_answer"])
Resonse:     [{'category': 'Gaming Consoles and Accessories', 'products': ['ProGamer Controller', 'ProGamer Racing Wheel', 'GameSphere VR Headset']}]
incorrect
prod_set: {'GameSphere VR Headset', 'ProGamer Racing Wheel', 'ProGamer Controller'}
prod_set_ideal: {'GameSphere X', 'GameSphere Y', 'GameSphere VR Headset', 'ProGamer Racing Wheel', 'ProGamer Controller'}
response is a subset of the ideal answer
0.0

So this is the best response to produce under gaming consoles and accessories for this customer message. Consequently, here is a list of gaming consoles and related items. However, although the given response had three outputs, it really ought to have had one, two, three, four, or five. Some of the products are therefore missing.

So, using a for loop to go through all 10 examples from the development set, I would tune the prompt as I normally would. In these examples, we repeatedly pull out the customer message, obtain the best answer possible, obtain a response, assess it, and then, accumulate it in average.

# Note, this will not work if any of the api calls time out
score_accum = 0
for i, pair in enumerate(msg_ideal_pairs_set):
    print(f"example {i}")
    
    customer_msg = pair['customer_msg']
    ideal = pair['ideal_answer']
    
    # print("Customer message",customer_msg)
    # print("ideal:",ideal)
    response = find_category_and_product_v2(customer_msg,
                                                      products_and_category)

    
    # print("products_by_category",products_by_category)
    score = eval_response_with_ideal(response,ideal,debug=False)
    print(f"{i}: {score}")
    score_accum += score
    

n_examples = len(msg_ideal_pairs_set)
fraction_correct = score_accum / n_examples
print(f"Fraction correct out of {n_examples}: {fraction_correct}")
example 0
0: 1.0
example 1
1: 1.0
example 2
2: 1.0
example 3
3: 1.0
example 4
4: 1.0
example 5
5: 1.0
example 6
6: 1.0
example 7
incorrect
prod_set: {'GameSphere VR Headset', 'ProGamer Racing Wheel', 'ProGamer Controller'}
prod_set_ideal: {'GameSphere X', 'GameSphere Y', 'GameSphere VR Headset', 'ProGamer Racing Wheel', 'ProGamer Controller'}
response is a subset of the ideal answer
7: 0.0
example 8
8: 1.0
example 9
9: 1
Fraction correct out of 10: 0.9

11 Run evaluation on all test cases and calculate the fraction of cases that are correct

Looking at the results above it seems like it got 90% correct. So, we can rerun this to observe if the percent accurate increases or decreases if the prompts are adjusted.

We now have the code necessary to gather a randomly sampled set of perhaps 100 samples with their ideal outputs, and you may even go beyond that to the rigour of a holdout test set that you don’t even look at while you’re setting the prompt. If you wanted a higher level of rigour.

Again, it’s important to point out that if you’re working on a safety-critical application or an application where there is a non-trivial risk of harm, it would be prudent to obtain a much larger test set to thoroughly confirm the performance before using it anywhere.

The pace of iteration seems to be significantly faster when creating applications utilising prompts and LLMs than when creating applications using supervised learning. And if you haven’t done it before, you might be amazed at how effective an evaluation process based on only a handful of carefully chosen challenging instances can be.

You make an assumption based on 10 examples, that the result is not statistically valid. But once you put this process into practise, you might be amazed at how helpful adding a few, just a few, challenging examples to development sets can be in terms of assisting you and your team in developing an effective set of prompts and system.

12 Evaluating LLM Applications more automatically using LangChain

We have seen in this article how we can use OpenAI and GPT alone to evaluate the outputs of these models. However there are other tools like LangChain together with OpenAI that can make LLM application evaluation even easier and faster as can be seen in this previous article.

13 Acknowledgements

I’d like to express my thanks to the wonderful Building Systems with the ChatGPT API Course by DeepLearning.ai and OpenAI - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe