Using LangChain to Evaluate LLM Applications

In this article we look at how LangChain can help evaluate LLM performance for a specific Application
natural-language-processing
deep-learning
langchain
openai
llm-evaluation
Author

Pranath Fernando

Published

June 5, 2023

1 Introduction

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. But using LLMs in isolation is often not enough in practice to create a truly powerful or useful business application - the real power comes when you are able to combine them with other sources of computation, services or knowledge. LangChain is an intuitive open-source python framework created to simplify the development of useful applications using large language models (LLMs), such as OpenAI or Hugging Face.

In earlier articles we introduced the LangChain library and key components.

In this article, we look at how LangChain can help evaluate LLM performance. This can be useful to understand how LLM’s are performing in general, or when you change some element of the application such as using a different model, or using a different type of vector store - being able to measure how that change might impact performance.

2 Setup

We will use OpenAI’s ChatGPT LLM for our examples, so lets load in the required libraries.

import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

3 Create our QandA application

First we need to define an LLM application that we want to evaluate. We are going to use a the same kind of document question answering chain that we used in this previous article. So we will load the same modules and data as we used in that previous example, as well as defining the retrival QA chain.

So this defines out application to evaluate.

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

4 Coming up with Test Datapoints

So we first need to figure out what data points to use to evaluate the application on?

One method is to come up with data points we think are good examples themselves, to do that we can look at some example questions then come up with some ground truth answers we can later use to evaluate the application. So if we look at a few of the documents we have, we can get a sense of whats there, and from these examples we can come up with some test question answer pairs.

data[10]
Document(page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10})
data[11]
Document(page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11})

4.1 Hard-coded examples

So lets create some examples manually.

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

The thing is, this is not a method that will scale well if we have many examples that we want to test to gain more confidence and if we have a large dataset.

So one way we might be able to automate this so it can scale is to use language models themselves to automate this task.

4.2 LLM-Generated examples

We can use the QAGenerateChain object to do this - to create a question answer pair for a set of documents, using a supplied language model.

We also want to use apply_and_parse as we want a dictionary back.

from langchain.evaluation.qa import QAGenerateChain
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)
new_examples[0]
{'query': "What is the weight of each pair of Women's Campside Oxfords?",
 'answer': "The approximate weight of each pair of Women's Campside Oxfords is 1 lb. 1 oz."}
data[0]
Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

4.3 Combine examples

So lets combine all this examples together.

examples += new_examples
qa.run(examples[0]["query"])


> Entering new RetrievalQA chain...

> Finished chain.
'The Cozy Comfort Pullover Set, Stripe has side pockets on the pull-on pants.'

5 Manual Evaluation

So we now have our examples for our test dataset, but how do we actually evaluate whats going on?

Firstly we can run one of our test examples through the chain and see what output it produces - what prompts and documents are being used? lets use LangChain debug to see more detail of whats going on in terms of prompts and references used to respond to the query.

import langchain
langchain.debug = True
qa.run(examples[0]["query"])
[chain/start] [1:chain:RetrievalQA] Entering Chain run with input:
{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[chain/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[inputs]
[chain/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.<<<<>>>>>: 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for lounging, this knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out. \n\nSize & Fit \nPants are Favorite Fit: Sits lower on the waist. \nRelaxed Fit: Our most generous fit sits farthest from the body. \n\nFabric & Care \nIn the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features \nRelaxed fit top with raglan sleeves and rounded hem. \nPull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg. \nImported.<<<<>>>>>: 632\nname: Cozy Comfort Fleece Pullover\ndescription: The ultimate sweater fleece \u2013 made from superior fabric and offered at an unbeatable price. \n\nSize & Fit\nSlightly Fitted: Softly shapes the body. Falls at hip. \n\nWhy We Love It\nOur customers (and employees) love the rugged construction and heritage-inspired styling of our popular Sweater Fleece Pullover and wear it for absolutely everything. From high-intensity activities to everyday tasks, you'll find yourself reaching for it every time.\n\nFabric & Care\nRugged sweater-knit exterior and soft brushed interior for exceptional warmth and comfort. Made from soft, 100% polyester. Machine wash and dry.\n\nAdditional Features\nFeatures our classic Mount Katahdin logo. Snap placket. Front princess seams create a feminine shape. Kangaroo handwarmer pockets. Cuffs and hem reinforced with jersey binding. Imported.\n\n \u2013 Official Supplier to the U.S. Ski Team\nTHEIR WILL TO WIN, WOVEN RIGHT IN. LEARN MORE<<<<>>>>>: 151\nname: Cozy Quilted Sweatshirt\ndescription: Our sweatshirt is an instant classic with its great quilted texture and versatile weight that easily transitions between seasons. With a traditional fit that is relaxed through the chest, sleeve, and waist, this pullover is lightweight enough to be worn most months of the year. The cotton blend fabric is super soft and comfortable, making it the perfect casual layer. To make dressing easy, this sweatshirt also features a snap placket and a heritage-inspired Mt. Katahdin logo patch. For care, machine wash and dry. Imported."
}
[llm/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain > 4:llm:ChatOpenAI] Entering LLM run with input:
{
  "prompts": [
    "System: Use the following pieces of context to answer the users question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n: 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.<<<<>>>>>: 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for lounging, this knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out. \n\nSize & Fit \nPants are Favorite Fit: Sits lower on the waist. \nRelaxed Fit: Our most generous fit sits farthest from the body. \n\nFabric & Care \nIn the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features \nRelaxed fit top with raglan sleeves and rounded hem. \nPull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg. \nImported.<<<<>>>>>: 632\nname: Cozy Comfort Fleece Pullover\ndescription: The ultimate sweater fleece \u2013 made from superior fabric and offered at an unbeatable price. \n\nSize & Fit\nSlightly Fitted: Softly shapes the body. Falls at hip. \n\nWhy We Love It\nOur customers (and employees) love the rugged construction and heritage-inspired styling of our popular Sweater Fleece Pullover and wear it for absolutely everything. From high-intensity activities to everyday tasks, you'll find yourself reaching for it every time.\n\nFabric & Care\nRugged sweater-knit exterior and soft brushed interior for exceptional warmth and comfort. Made from soft, 100% polyester. Machine wash and dry.\n\nAdditional Features\nFeatures our classic Mount Katahdin logo. Snap placket. Front princess seams create a feminine shape. Kangaroo handwarmer pockets. Cuffs and hem reinforced with jersey binding. Imported.\n\n \u2013 Official Supplier to the U.S. Ski Team\nTHEIR WILL TO WIN, WOVEN RIGHT IN. LEARN MORE<<<<>>>>>: 151\nname: Cozy Quilted Sweatshirt\ndescription: Our sweatshirt is an instant classic with its great quilted texture and versatile weight that easily transitions between seasons. With a traditional fit that is relaxed through the chest, sleeve, and waist, this pullover is lightweight enough to be worn most months of the year. The cotton blend fabric is super soft and comfortable, making it the perfect casual layer. To make dressing easy, this sweatshirt also features a snap placket and a heritage-inspired Mt. Katahdin logo patch. For care, machine wash and dry. Imported.\nHuman: Do the Cozy Comfort Pullover Set        have side pockets?"
  ]
}
[llm/end] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain > 4:llm:ChatOpenAI] [2.39s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "Yes, the Cozy Comfort Pullover Set has side pockets.",
        "generation_info": null,
        "message": {
          "content": "Yes, the Cozy Comfort Pullover Set has side pockets.",
          "additional_kwargs": {},
          "example": false
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 734,
      "completion_tokens": 13,
      "total_tokens": 747
    },
    "model_name": "gpt-3.5-turbo"
  }
}
[chain/end] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] [2.39s] Exiting Chain run with output:
{
  "text": "Yes, the Cozy Comfort Pullover Set has side pockets."
}
[chain/end] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] [2.39s] Exiting Chain run with output:
{
  "output_text": "Yes, the Cozy Comfort Pullover Set has side pockets."
}
[chain/end] [1:chain:RetrievalQA] [2.85s] Exiting Chain run with output:
{
  "result": "Yes, the Cozy Comfort Pullover Set has side pockets."
}
'Yes, the Cozy Comfort Pullover Set has side pockets.'

This has given us a lot more information about whats going on, the context passed to the prompt etc. Sometimes when something goes wrong its not the model thats going wrong but the retreival thats going wrong e.g. the document context returned for our prompt.

So taking a close look at the context used for the question can help debug any potential problems.

We can then see the final prompt sent to the language model itself ‘ChatOpenAI’ here we can see the full prompt used, including the human part of the question at the end which is the question we asked it. We can also see more detail about what we get back from the model such as token usage, model name etc. This can be useful to track the total number of tokens used which correlates with the total cost (when using a paid for service like OpenAI).

We can see the final response gets bubbled up through the chain to the final response to the user.

So thats just one example, how are we going to evaluate multiple examples? We could of course repeat this process for all examples, but again this won’t scale well.

6 LLM Assisted Evaluation

So can we again use a language model to automate the evaluation of multiple examples? Firstly we need to create some predictions for our test examples.

We can use apply() to generate these predictions for all examples (having turned debug off).

# Turn off the debug mode
langchain.debug = False
predictions = qa.apply(examples)


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.

So we can use QAEvalChain to evaluate these predictions, using a language model.

from langchain.evaluation.qa import QAEvalChain
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)
graded_outputs = eval_chain.evaluate(examples, predictions)
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()
Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set has side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the weight of each pair of Women's Campside Oxfords?
Real Answer: The approximate weight of each pair of Women's Campside Oxfords is 1 lb. 1 oz.
Predicted Answer: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the medium Recycled Waterhog dog mat?
Real Answer: The dimensions of the medium Recycled Waterhog dog mat are 22.5" x 34.5".
Predicted Answer: The dimensions of the medium Recycled Waterhog dog mat are 22.5" x 34.5".
Predicted Grade: CORRECT

Example 4:
Question: What is the fabric of the Infant and Toddler Girls' Coastal Chill Swimsuit made of?
Real Answer: The Infant and Toddler Girls' Coastal Chill Swimsuit is made of four-way-stretch and chlorine-resistant fabric.
Predicted Answer: The Infant and Toddler Girls' Coastal Chill Swimsuit is made of a four-way-stretch and chlorine-resistant fabric. The specific fabric material is not mentioned.
Predicted Grade: CORRECT

Example 5:
Question: What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?
Real Answer: The body of the tankini is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon with 18% Lycra® spandex for the body and 90% recycled nylon with 10% Lycra® spandex for the lining.
Predicted Grade: CORRECT

Example 6:
Question: What is the main feature of the EcoFlex 3L Storm Pants?
Real Answer: The main feature of the EcoFlex 3L Storm Pants is the state-of-the-art TEK O2 technology that offers the most breathability ever tested.
Predicted Answer: The main feature of the EcoFlex 3L Storm Pants is the state-of-the-art TEK O2 technology that offers the most breathability ever tested, making them great for a variety of outdoor activities year-round.
Predicted Grade: CORRECT

So lets look through these evaluated examples.

In the first example we can see the predictions for ‘Do the Cozy Comfort Pullover Set have side pockets?’ are correct. But why are we using the language model in the first place?

The actual strings of the Real vs Predicted here are very different, one is very short the other very long - Yes does not even appear in the other string. So if we were going to do some kind of string matching for evaluation or one based on similar words such as the NLP text similarity metric BLEU score it would not work because the similarity is not based on superficial aspects of language such as words but deeper aspects of language such as meaning. And this is exactly the kind of understanding that language models can do, which are not based on any kind of specific rule.

This is what makes evaluation of language models so hard in the first place, but ironically enables us to use language models to solve it. This makes previous NLP evaluation metrics such as the BLEU score inadaquate for evaluatiing these more complex models, so we need to invent new ones such as this method - which is one of the most popular methods currently.

There is also the LangChain evaluation platform which is a way to do all of this and persist it in a dedicated interface.

7 Acknowledgements

I’d like to express my thanks to the wonderful LangChain for LLM Application Development Course by DeepLearning.ai - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe