Evaluating Question and Answer Systems with Dynamic Data

In many real-world settings, the proper answer to a question may alter over time. For example, if you’re designing a Q&A system on top of a database or that connects to an API, the underlying data may be updated regularly. Instead of storing labels directly as values, we’ll utilise references to overcome this issue in this post using Langsmith where our labels will be references to look up the relevant values.

Pranath Fernando


August 18, 2023

In many real-world settings, the correct answer to a question may alter over time. For example, if you’re designing a Q&A system on top of a database or that connects to an API, the underlying data may be updated regularly. In such instances, you should still measure the correctness of your system, but you should do so in a method that compensates for these changes.

In an earlier article we introduced Langsmith and how it can help with LLM-based application evaluation.

In the following post, we will handle this issue using Langsmith and the age-old software practise of indirection. Rather than storing labels as values, we will use them as references to look for the correct values. In this example, our labels will be queries that the custom evaluator can use to retrieve the ground truth answer and compare it to the model’s predictions.

The article will walk you through the following steps:

  1. Create a dataset of questions and corresponding code snippets to fetch the answers.
  2. Define your Q&A system.
  3. Run evaluation using LangSmith with a custom evaluator.
  4. Re-test the system over time.

Quick note: We are using a CSV file to simulate a real data source. This is not a real scenario and is meant to be illustrative.

1 Prerequisites

This post uses OpenAI for the model and LangChain to compose the chain. To make sure the tracing and evals are set up for LangSmith, please configure your API Key appropriately.


Install the required packages. We will use the latest version of langchain and use pandas as an example of a data store.

# %pip install -U "langchain[openai]" > /dev/null
# %pip install pandas > /dev/null

2 Create a dataset

We will be using the Titanic dataset from here for our example. This dataset contains information about Titanic passengers and their outcomes.

To begin, create a set of questions and accompanying references that demonstrate how to obtain the proper answer from the data. For the purposes of this lesson, we will utilise Python code snippets, but the strategy may be applied to any other type of indirection, such as storing API requests or search arguments.

Our evaluator will consult the sources to determine the correct response.

questions = [
    ("How many passengers were on the Titanic?", "len(df)"),
    ("How many passengers survived?", "df['Survived'].sum()"),
    ("What was the average age of the passengers?", "df['Age'].mean()"),
    ("How many male and female passengers were there?", "df['Sex'].value_counts()"),
    ("What was the average fare paid for the tickets?", "df['Fare'].mean()"),
    ("How many passengers were in each class?", "df['Pclass'].value_counts()"),
    ("What was the survival rate for each gender?", "df.groupby('Sex')['Survived'].mean()"),
    ("What was the survival rate for each class?", "df.groupby('Pclass')['Survived'].mean()"),
    ("Which port had the most passengers embark from?", "df['Embarked'].value_counts().idxmax()"),
    ("How many children under the age of 18 survived?", "df[df['Age'] < 18]['Survived'].sum()")

Next, create the dataset. You can use the LangSmith SDK to do so. Create the dataset and upload each example. Saving the dataset to LangSmith lets us reuse and relate test runs over time.

from langsmith import Client

client = Client()
dataset_name = "Dynamic Titanic CSV"
dataset = client.create_dataset(
    dataset_name=dataset_name, description="Test QA over CSV",

for example in questions:
        inputs={"question": example[0]},
        outputs={"code": example[1]},

3 Define Q&A system

Now that the dataset has been produced, we can define our question answering system. For this project, we’ll use LangChain’s off-the-shelf pandas dataframe agent.

Load the Titanic data into a dataframe first, and then write a constructor for our agent.

import pandas as pd

titanic_path = "https://raw.githubusercontent.com/jorisvandenbossche/pandas-tutorial/master/data/titanic.csv"
df = pd.read_csv(titanic_path)
from functools import partial

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.agents import create_pandas_dataframe_agent

llm = ChatOpenAI(model="gpt-4", temperature=0.0)
create_chain = partial(
# Example run
create_chain().invoke({"input": "How many passengers were on the Titanic?"})
{'input': 'How many passengers were on the Titanic?',
 'output': 'There were 891 passengers on the Titanic.'}

4 Run Evaluation

Now it’s time to define our custom evaluator. In this case we will inherit from the LabeledCriteriaEvalChain class. This evaluator takes the input, prediction, and reference label and passes them to an llm to predict whether the prediction satisfies the provided criteria, conditioned on the reference label.

Our custom evaluator will make one small change to this evaluator by dereferencing the label to inject the correct value. We do this by overwriting the _get_eval_input method. Then the LLM will see the fresh reference value.

Reminder: We are using a CSV file to simulate a real data source here and doing an unsafe eval on to query the data source. In a real scenario it would be better to do a safe get request or something similar.

from langsmith import Client
from langchain.smith import RunEvalConfig, run_on_dataset
from typing import Optional
from langchain.evaluation.criteria.eval_chain import LabeledCriteriaEvalChain

class CustomCriteriaEvalChain(LabeledCriteriaEvalChain):
    def _get_eval_input(
        prediction: str,
        reference: Optional[str],
        input: Optional[str],
    ) -> dict:
        # The parent class validates the reference is present and combines into
        # a dictionary for the llm chain.
        raw = super()._get_eval_input(prediction, reference, input)
        # Warning - this evaluates the code you've saved as labels in the dataset.
        # Be sure that the code is correct, and refrain from executing in an
        # untrusted environment or when connected to a production server.
        raw["reference"] = eval(raw["reference"])
        return raw
client = Client()
eval_config = RunEvalConfig(
        CustomCriteriaEvalChain.from_llm(criteria="correctness", llm=ChatOpenAI(model="gpt-4", temperature=0.0)),
chain_results = run_on_dataset(
    # This agent doesn't support concurrent runs yet.
View the evaluation results for project 'e1a16797963742018b9625ef311371ee-AgentExecutor' at:

With that evalution running, you can navigate to the linked project and review the agent’s predictions and feedback scores.

5 Re-evaluate later in time.

It’s safe to conclude that the Titanic dataset hasn’t changed in the last few minutes, but in your case, fresh data is very certainly arriving all the time. We can reuse the old dataset as long as the method of accessing that information hasn’t changed.

Let’s assume that additional folks boarded by duplicating some rows and rearranging some statistics. The agent will then be re-run on the new dataset.

df_doubled = pd.concat([df, df], ignore_index=True)
df_doubled['Age'] = df_doubled['Age'].sample(frac=1).reset_index(drop=True)
df_doubled['Sex'] = df_doubled['Sex'].sample(frac=1).reset_index(drop=True)
df = df_doubled
create_chain_2 = partial(
chain_results = run_on_dataset(
View the evaluation results for project 'c1d72bd05c6342dba7b9c52d883ae995-AgentExecutor' at:

5.1 Review the results

You can see the results now that we’ve tested twice on the “changing” data source. If you go to the “dataset” page and click on the “examples” tab, you can view the predictions for each test run by clicking through different examples.

The view of the individual dataset rows is shown below. We can edit the example or examine all predictions from different test runs on that example by clicking on a row. Let’s choose one.

In this example, we choose the row with the question “How many male and female passengers were there?” The predictions for each test run are shown in a table of linked rows at the bottom of the page. When you call ‘run_on_dataset’, these are automatically associated.

If you look closely at the predictions, you’ll notice that they’re all different. The agency initially projected 577 male and 314 female passengers. It then forecasted 1154 male and 628 female passengers for the second test run.

However, both test runs were marked as “correct”. The values in the data source changed, but the mechanism for retrieving the answer did not.

But how can you be certain that the “correct” grade is accurate? Now is a good moment to double-check the run trace of your custom evaluator to ensure that it is functioning properly. You can view the evaluation trace by directly clicking on the arrows on the “correctness” chips in the table. Otherwise, you can navigate to the run, then to the feedback tab, and then to your custom evaluator’s trace for that example. Screenshots of the retrieved values for each of the preceding runs are shown below.

You can see that the “reference” key contains the dereferenced value from the data source. You can see that it matches the predictions from the runs above! The first one shows 577 male and 314 female passengers.

And, after updating the dataframe, the evaluator returned the accurate result of 1154 male and 628 female travellers, which matches the forecasts from the previous runs!

Seems to be working well!

6 Conclusion

In this post, we examined a Q&A system that was linked to a developing data repository. We accomplished so by employing a custom evaluator that dynamically retrieves the answer based on a static reference (in this case, a code snippet).

This is only one solution to the challenge of evaluating Q&A systems when the underlying data source changes. This approach is straightforward and immediately tests the accuracy of your system end-to-end using current data. It can be useful if you wish to monitor your performance on a regular basis.

It is less reliable if your purpose is to compare two different prompts or models because the underlying data may differ. Depending on how you dereference the labels, prudence and correct permissioning are also required.

Other options to evaluate your system in this scenario include:

  • Freezing or mocking the data source(s) used for evaluation. You can then invest in hand-labeling periodically to make sure the data is still reprentative of the production environment.
  • Testing the query generation capability of your agent directly and evaluate the equivalence of the queries. This is less “end-to-end”, but it depending on how you compare, you’d avoid any potential issues caused by unsafe dereferencing.

7 Acknowledgements

I’d like to express my thanks to the wonderful Langsmith Cookbook Repo and acknowledge the use of some images and other materials from this project in writing this article.