LivingDataLab - Measuring the Accuracy of an LLM based Question and Answering System

1 Introduction

Evaluating a question and response system can help you improve its system architecture as well as the prompt and model quality. We tend to improve what we can measure, therefore verifying for correctness is a key focus. One difficulty in gauging accuracy is that the responses are unstructured text. A Q&A system can generate lengthy responses, rendering typical metrics like BLEU or ROUGE unreliable. In this case, employing a well-labeled dataset and llm-assisted assessors can help you rate the response quality of your system. This supplemented any human review and other measurements you may have already implemented.

In an earlier article we introduced Langsmith and how it can help with LLM-based application evaluation.

In this post, we will utilise LangSmith to validate a Q&A system against an example dataset. The main steps are as follows:

Create a dataset of questions and answers.
Define your question and answering system.
Run evaluation using LangSmith.
Iterate to improve the system.

The test run will be saved in a project along with all its feedback and links to every evaluator run.

Note 1: This walkthrough tests the end-to-end behavior of the system. Separately evaluating each component of system is still important! Many components, such as the retrievers, can be tested separately using standard retrieval metrics to complement this full integration test.

Note 2: If your knowledge base is changing, make sure your answers are still correct! You can avoid this through some combination of independent testing of chain components, freezing the knowledge source used during testing, and regularly updating your dataset.

2 Prerequisites

This tutorial uses OpenAI for the model, ChromaDB to store documents, and LangChain to compose the chain. To make sure the tracing and evals are set up for LangSmith, please configure your API Key appropriately.

# %env LANGCHAIN_API_KEY=<YOUR_API_KEY>

Install the required packages. lxml and html2text are used by the document loader.

# %pip install -U "langchain[openai]" > /dev/null
# %pip install chromadb > /dev/null
# %pip install lxml > /dev/null
# %pip install html2text > /dev/null

# %env OPENAI_API_KEY=<YOUR-API-KEY>

3 Create a Dataset

In our case, we will compare a Q&A system to the LangSmith documentation. To calculate aggregate accuracy, we’ll need to compile a list of example question-answer paris. To demonstrate the procedure, we’ve hard-coded several examples below. In general, you’ll need a lot more pairs (>100) to get relevant results. Drawing from actual inquiries can help to create a more accurate portrayal of the domain.

Below, we have hard-coded several question-answer pairs to assess and created each example row using the client’s ‘create_example’ method.

# We have some hard-coded examples here.
examples = [
    ("What is LangChain?", "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith."),
    ("How might I query for all runs in a project?", "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})"),
    ("What's a langsmith dataset?", "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point."),
    ("How do I use a traceable decorator?", """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```"""),
    ("Can I trace my Llama V2 llm?", "So long as you are using one of LangChain's LLM implementations, all your calls can be traced"),
    ("Why do I have to set environment variables?", "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
     " While there are other ways to connect, environment variables tend to be the simplest way to configure your application."),
    ("How do I move my project between organizations?", "LangSmith doesn't directly support moving projects between organizations.")
]

from langsmith import Client

client = Client()

dataset_name = "Retrieval QA Questions"
dataset = client.create_dataset(dataset_name=dataset_name)
for q, a in examples:
    client.create_example(inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id)

4 Define RAG Q&A System

Our Q&A system employs a straightforward retriever and an LLM response generator. To further simplify, the chain will consist of:

A VectorStoreRetriever to retrieve documents. This uses:
- An embedding model to vectorize documents and user queries for retrieval. In this case, the OpenAIEmbeddings model.
- A vectorstore, in this case we will use Chroma
A response generator. This uses:
- A ChatPromptTemplate to combine the query and documents.
- An LLM, in this case, the 16k token context window version of gpt-3.5-turbo via ChatOpenAI.

We will combine them using LangChain’s expression syntax.

First, load the documents to populate the vectorstore:

from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
doc_transformer = Html2TextTransformer()
raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)
documents = text_splitter.split_documents(transformed)

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(

With the documents prepared, create the vectorstore retriever. This is what will be used to provide context when generating a response.

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

Next up, we’ll define the response generator. This responds to the user by injecting the retrieved documents and the user query into a prompt template.

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser

from datetime import datetime

prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from LangSmith's documentation."
            " LangChain is a framework for building applications using large language models."
            "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
            ("system", "{context}"),
            ("human","{question}")
        ]
    ).partial(time=str(datetime.now()))
    
model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
response_generator = (
    prompt 
    | model 
    | StrOutputParser()
)

Finally, assemble the full chain!

# The full chain looks like the following
from operator import itemgetter

chain = (
    # The runnable map here routes the original inputs to a context and a question dictionary to pass to the response generator
    {
        "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
        "question": itemgetter("question")
    }
    | response_generator
)

for tok in chain.stream({"question": "How do I log user feedback to a run?"}):
    print(tok, end="", flush=True)

To log user feedback to a run, you can use the LangSmith client or the REST API. Here's an example of how to log user feedback using the LangSmith client in Python:

```python
from langsmith import Client

client = Client()
feedback = client.create_feedback(
    "<run_id>",
    "<feedback_key>",
    score=True,
    comment="This is a positive feedback from the user."
)
```

In this example, you need to replace `<run_id>` with the ID of the run you want to log feedback for, and `<feedback_key>` with a key that represents the type of feedback you want to log (e.g., "positive", "negative", "accuracy", etc.). You can also provide additional information such as a score, comment, and other optional fields.

Make sure you have the necessary environment variables set up for the LangSmith client to authenticate with the LangSmith API.

If you prefer to use the REST API directly, you can make a POST request to the `/feedback` endpoint with the necessary parameters.

Remember that logging user feedback is an important step in improving your LLM application and ensuring a high-quality user experience.

5 Evaluate the Chain

We will use the off-the-shelf QA evaluator to measure the correctness of the retrieval Q&A responses

from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    evaluators=["qa"],
    # If you want to configure the eval LLM:
    # eval_llm=ChatAnthropic(model="claude-2", temperature=0)
)

Run the evaluation. This makes predictions over the dataset and then uses the “QA” evaluator to check the correctness on each data point.

_ = await client.arun_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=lambda: chain,
    evaluation=eval_config,
)

View the evaluation results for project '6ed4213fc4c54b3fbcfdd9cba14e87f0-RunnableSequence' at:
https://smith.langchain.com/projects/p/da05d8be-995f-4ee7-8d1b-ce8943bb085e?eval=true

You can visit to the produced “test run” project to examine the chain’s outputs, evaluator feedback, and connections to the evaluation traces as the test run progresses.

You can filter the results depending on feedback metrics from the test project page. For example, in the filters section, click on “Correctness==0” to show the instances designated as inaccurate.

After you’ve filtered the results, you can view the traces and triage where the chain failed by clicking on the individual runs. You may see for yourself by clicking on the image below. By selecting the “Feedback” option, you may view the evaluation results for this run.

To view the trace of the evaluator run, click the link underlined in red above. Because LLM-assisted evaluations are flawed, analysing their traces allows you to evaluate the feedback decisions and choose when and how to modify the prompt to your individual use case.

This trace was marked as “incorrect”. It looks like the chain is making up information, or “hallucinating.” If you click on the ChatOpenAI run in your own test project, you can open it in the playground to experiment with changes that may address this error.

Let’s try tweaking the prompt to better instruct the model. We’ll add an additional system message to remind the model to only respond based on the retrieved documents. Click “Add Message” and paste in the following text:

Respond as best as you can. If no documents are retrieved or if you do not see an answer in the retrieved documents, admit you do not know or that you don’t see it being supported at the moment.

Click “Submit” to view the results streamed to the message in the right column. If you haven’t already added your OpenAI key, you can do so using the “Secrets & API Keys” button.

That seems to have the desired effect for this data point, but we want to be careful that we’re not overfitting to a single example. We’ll want to re-evaluate to confirm.

6 Iterate

The chain performed well, and in the previous part, we were able to use the playground to generate a potential solution to the problem. Let’s re-run the evaluation with the updated prompt to see how it does overall. We’ve duplicated the chain code below and added an additional system message to the chat prompt template:

prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from LangSmith's documentation."
            "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
            ("system", "{context}"),
            ("human","{question}"),
            # Add the new system message here:
            ("system", "Respond as best as you can. If no documents are retrieved or if you do not see an answer in the retrieved documents,"
             " admit you do not know or that you don't see it being supported at the moment."),
        ]
    ).partial(time=lambda: str(datetime.now()))
    
model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
response_generator_2 = (
    prompt 
    | model 
    | StrOutputParser()
)
chain_2 = (
    {
        "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
        "question": itemgetter("question")
    }
    | response_generator_2
)

Rerun the evaluation and check out the results as they become available.

_ = await client.arun_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=lambda: chain_2,
    evaluation=eval_config,
)

View the evaluation results for project 'e42329ecc0c3492282fe02bead607ce9-RunnableSequence' at:
https://smith.langchain.com/projects/p/e42be46d-a0e3-40f6-ac58-59bcb70c3e6a?eval=true

We can now begin comparing findings. To see the aggregate feedback metrics for each test run, go to the “Retrieval QA Questions” dataset page. Click the datasets & testing icon on the left bar to view your datasets.

It appears that the new chain is now passing all of the cases. Remember that this illustrative dataset is too tiny to provide a thorough picture of the chain’s performance. We may add more examples to the dataset as we continue to prototype this chain.

You can view specific forecasts on each row in addition to the aggregate feedback metrics. To view each row in the dataset, select the “Examples” tab. When you click on a specific example, it will display the results of both test runs for that data point. You may rapidly compare forecasts across chain versions using the linked runs table to get an idea of the types of outputs you might expect. You can re-view the full traces by clicking on each connected run.

7 Conclusion

You’ve just completed a fast assessment of the accuracy of your Q&A system. You used LangSmith in this post to uncover problems in a RAG pipeline and make immediate changes to improve the chain’s performance. You’ve also learnt about evaluator feedback and how to use it into your LLM app development process. This is an excellent place to start when it comes to enhancing the consistency of your LLM applications.

8 Acknowledgements

I’d like to express my thanks to the wonderful Langsmith Cookbook Repo and acknowledge the use of some images and other materials from this project in writing this article.

Measuring the Accuracy of an LLM based Question and Answering System

Subscribe