Checking Outputs of Large Language Models like ChatGPT

In this article we will focus on checking outputs generated by an LLM before showing them to users - which can be important for ensuring the quality, relevance, and safety of the responses provided to them or used in automation flows
natural-language-processing
deep-learning
openai
Author

Pranath Fernando

Published

June 23, 2023

1 Introduction

Large language models such as ChatGPT can generate text responses based on a given prompt or input. Writing prompts allow users to guide the language model’s output by providing a specific context or topic for the response. This feature has many practical applications, such as generating creative writing prompts, assisting in content creation, and even aiding in customer service chatbots.

In earlier articles i’ve looked at how you can use ChatGPT to solve some of these tasks with simple prompts. But in many use cases, what is required is not just one prompt but a sequence of prompts where we need to also consider the outputs at each stage, before providing a final output - for example with a customer service chatbot.

In this article, we’ll focus on checking outputs generated by an LLM. Checking outputs before showing them to users can be important for ensuring the quality, relevance and safety of the responses provided to them or used in automation flows. We’ll learn how to use the ChatGPT OpenAI moderation API, but this time for outputs, and how to use additional prompts to the model to evaluate output quality before displaying them.

2 Setup

2.1 Load the API key and relevant Python libaries.

First we need to load certain python libs and connect the OpenAi api.

The OpenAi api library needs to be configured with an account’s secret key, which is available on the website.

You can either set it as the OPENAI_API_KEY environment variable before using the library: !export OPENAI_API_KEY='sk-...'

Or, set openai.api_key to its value:

import openai
openai.api_key = "sk-..."
import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']
# Define helper function
def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0, max_tokens=500):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message["content"]

3 Check output for potentially harmful content

We’ve previously looked at the moderation API in the context of evaluating inputs. Let’s go over it once more in the context of examining outputs. The outputs produced by the system itself can also be filtered and moderated using the ChatGPT Moderation API. So, let me give you an example. Let’s check to see if this output is flagged now.

final_response_to_customer = f"""
The SmartX ProPhone has a 6.1-inch display, 128GB storage, \
12MP dual camera, and 5G. The FotoSnap DSLR Camera \
has a 24.2MP sensor, 1080p video, 3-inch LCD, and \
interchangeable lenses. We have a variety of TVs, including \
the CineView 4K TV with a 55-inch display, 4K resolution, \
HDR, and smart TV features. We also have the SoundMax \
Home Theater system with 5.1 channel, 1000W output, wireless \
subwoofer, and Bluetooth. Do you have any specific questions \
about these products or any other products we offer?
"""
response = openai.Moderation.create(
    input=final_response_to_customer
)
moderation_output = response["results"][0]
print(moderation_output)
{
  "categories": {
    "hate": false,
    "hate/threatening": false,
    "self-harm": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
    "hate": 4.313069e-07,
    "hate/threatening": 5.590539e-10,
    "self-harm": 2.91932e-10,
    "sexual": 2.1767946e-06,
    "sexual/minors": 1.2402804e-08,
    "violence": 5.962453e-06,
    "violence/graphic": 4.4420557e-07
  },
  "flagged": false
}

You can see that this output is not flagged and gets extremely low ratings in every category, which is understandable given the content. Checking the outputs can be crucial generally speaking. Use lower criteria for output flagging, for instance, if you were developing a chatbot for sensitive audiences. Generally speaking, if the moderation output shows that the material has been marked, you can reply appropriately by providing a fallback response or by creating a new response. As the models get better, it should be noted that the likelihood that they would produce a negative result is decreasing.

4 Check if output is factually based on the provided product information

Asking the model directly if the results were satisfactory and met the criteria you defined is another method for ensuring the quality of outputs. This can be achieved by giving the model the generated output as input and asking it to assess the output’s quality. There are numerous methods you can accomplish this. let’s look at an example.

So, say our system message is:

“You are an assistant that evaluates whether customer service agent responses sufficiently answer customer questions and also validates that all the facts the assistant cites from the product information are correct. The product information and user and customer service agent messages will be delivered by three backticks. Respond with a Y or N character with no punctuation. Y if the output sufficiently answers the question and the response correctly uses product information and N otherwise. Output a single letter only.”.

And you could also use a chain of thought reasoning prompt for this. You could experiment with this since the model might find it challenging to validate both in one go. You could also provide other kinds of rules. You may provide a question or provide a rubric, such as one for marking an essay or an exam. If it’s something that’s really important to you, you could utilise that structure and ask whether the tone used here is in keeping with our brand rules. You could also express some of your brand guidelines.

And now we’ll define our comparison. So the customer message is the customer message, the product information, and then the agent response, which is the response to the customer that we have from this previous cell. So let’s format this into a messages list and get the response from the model.

system_message = f"""
You are an assistant that evaluates whether \
customer service agent responses sufficiently \
answer customer questions, and also validates that \
all the facts the assistant cites from the product \
information are correct.
The product information and user and customer \
service agent messages will be delimited by \
3 backticks, i.e. ```.
Respond with a Y or N character, with no punctuation:
Y - if the output sufficiently answers the question \
AND the response correctly uses product information
N - otherwise

Output a single letter only.
"""
customer_message = f"""
tell me about the smartx pro phone and \
the fotosnap camera, the dslr one. \
Also tell me about your tvs"""
product_information = """{ "name": "SmartX ProPhone", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-PP10", "warranty": "1 year", "rating": 4.6, "features": [ "6.1-inch display", "128GB storage", "12MP dual camera", "5G" ], "description": "A powerful smartphone with advanced camera features.", "price": 899.99 } { "name": "FotoSnap DSLR Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-DSLR200", "warranty": "1 year", "rating": 4.7, "features": [ "24.2MP sensor", "1080p video", "3-inch LCD", "Interchangeable lenses" ], "description": "Capture stunning photos and videos with this versatile DSLR camera.", "price": 599.99 } { "name": "CineView 4K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-4K55", "warranty": "2 years", "rating": 4.8, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "A stunning 4K TV with vibrant colors and smart features.", "price": 599.99 } { "name": "SoundMax Home Theater", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-HT100", "warranty": "1 year", "rating": 4.4, "features": [ "5.1 channel", "1000W output", "Wireless subwoofer", "Bluetooth" ], "description": "A powerful home theater system for an immersive audio experience.", "price": 399.99 } { "name": "CineView 8K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-8K65", "warranty": "2 years", "rating": 4.9, "features": [ "65-inch display", "8K resolution", "HDR", "Smart TV" ], "description": "Experience the future of television with this stunning 8K TV.", "price": 2999.99 } { "name": "SoundMax Soundbar", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-SB50", "warranty": "1 year", "rating": 4.3, "features": [ "2.1 channel", "300W output", "Wireless subwoofer", "Bluetooth" ], "description": "Upgrade your TV's audio with this sleek and powerful soundbar.", "price": 199.99 } { "name": "CineView OLED TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-OLED55", "warranty": "2 years", "rating": 4.7, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "Experience true blacks and vibrant colors with this OLED TV.", "price": 1499.99 }"""
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{final_response_to_customer}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question

Output Y or N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages, max_tokens=1)
print(response)
Y

As a result, the model responds that the question has been adequately addressed and the product information is accurate. In general, it is preferable to utilise a more sophisticated model for these types of evaluation tasks because they are simply more logical. As a result, consider GPT-4. Let’s use one more example.

So say a example response is:

“life is like a box of chocolates”.

So let’s add our message to do with the output checking. And the model has determined that this does not sufficiently answer the question or use the retrieved information. This question:

“does it use the retrieved information correctly?”

This is a good prompt to use if you want to make sure that the model isn’t hallucinating, which is making up things that aren’t true.

another_response = "life is like a box of chocolates"
q_a_pair = f"""
Customer message: ```{customer_message}```
Product information: ```{product_information}```
Agent response: ```{another_response}```

Does the response use the retrieved information correctly?
Does the response sufficiently answer the question?

Output Y or N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages)
print(response)
N

As you can see, the model can give feedback on the quality of an output that is generated, and you can use this feedback to determine whether to show the output to the user or to create a new answer. You could even try creating numerous model responses for each user inquiry and letting the model decide which one to present to the user. There are lots of different things you could try.

In general, checking outputs using the moderation API is good practice, while asking the model to evaluate its own output might be useful for immediate feedback to ensure the quality of responses in a very small number of cases.

It’s probably unnecessary most of the time, especially if you’re using a more advanced model like GPT-4.

It’s unlikely to be appropriate in production as it would also increase the latency and cost of your system, because you’d have to wait for an additional call for the model, and that’s also additional tokens. If it’s really important for your app or product that your error rate is 0.0000001%, then maybe you should try this approach. But overall, we wouldn’t really recommend that you do this in practice.

5 Acknowledgements

I’d like to express my thanks to the wonderful Building Systems with the ChatGPT API Course by DeepLearning.ai and OpenAI - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe