LivingDataLab - Comparing Question and Answer LLM System Outputs

The most frequent method for comparing two models is to run them both on the same dataset and compare the aggregate metrics. This method is valuable, but it may miss out on important information regarding the quality of the two system alternatives. In this instance, it may be useful to run direct pairwise comparisons on the responses and examine the resulting preference scores.

In this post, we’ll show you how to do it in code. As a motivating example, we will employ a retrieval Q&A system over LangSmith’s docs.

The main steps are:

Setup
- Create a dataset of questions and answers.
- Define different versions of your chains to evaluate.
- Evaluate chains directly on a dataset using regular metrics (e.g. correctness).
Evaluate the pairwise preferences over that dataset

In this case, we will test the impact of chunk sizes on our result quality.

1 Prerequisites

This tutorial uses OpenAI for the model, ChromaDB to store documents, and LangChain to compose the chain. To make sure the tracing and evals are set up for LangSmith, please configure your API Key appropriately.

We will also use pandas to render the results in the notebook.

# %env LANGCHAIN_API_KEY=<YOUR_API_KEY>

Install the required packages. lxml and html2text are used by the document loader.

# %pip install -U "langchain[openai]" --quiet
# %pip install chromadb --quiet
# %pip install lxml --quiet
# %pip install html2text --quiet
# %pip install pandas --quiet
# %pip install nest_asyncio --quiet

# %env OPENAI_API_KEY=<YOUR-API-KEY>

# Used for running in jupyter
import nest_asyncio

nest_asyncio.apply()

2 Setup

2.1 Create a dataset

A development dataset is required for any evaluation procedure. To demonstrate the procedure, we’ve hard-coded a few samples below. In general, statistically significant results require a large number of pairs (>100). Drawing on actual user inquiries can help to provide a more accurate portrayal of the domain.

examples = [
    ("What is LangChain?", "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith."),
    ("How might I query for all runs in a project?", "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})"),
    ("What's a langsmith dataset?", "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point."),
    ("How do I use a traceable decorator?", """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```"""),
    ("Can I trace my Llama V2 llm?", "So long as you are using one of LangChain's LLM implementations, all your calls can be traced"),
    ("Why do I have to set environment variables?", "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
     " While there are other ways to connect, environment variables tend to be the simplest way to configure your application."),
    ("How do I move my project between organizations?", "LangSmith doesn't directly support moving projects between organizations.")
]

from langsmith import Client

client = Client()

dataset_name = "Retrieval QA Questions"
dataset = client.create_dataset(dataset_name=dataset_name)
for q, a in examples:
    client.create_example(inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id)

2.2 Define RAG Q&A system

Our Q&A system employs a straightforward retriever and an LLM response generator. To further simplify, the chain will consist of:

A VectorStoreRetriever to retrieve documents. This uses:
- An embedding model to vectorize documents and user queries for retrieval. In this case, the OpenAIEmbeddings model.
- A vectorstore, in this case we will use Chroma.
A response generator. This uses:
- A ChatPromptTemplate to combine the query and documents.
- An LLM, in this case, the 16k token context window version of gpt-3.5-turbo via ChatOpenAI.

We will combine them using LangChain’s expression syntax.

First, load the documents to populate the vectorstore:

from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
doc_transformer = Html2TextTransformer()
raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)

def create_retriever(transformed_documents, text_splitter):
    documents = text_splitter.split_documents(transformed_documents)
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(documents, embeddings)
    return vectorstore.as_retriever(search_kwargs={"k": 4})

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(

Next up, we’ll define the chain. Since we are going to vary the retriever parameters, our constructor will take the retriever as an argument.

from datetime import datetime
from operator import itemgetter

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser


def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
            [
                ("system", "You are a helpful documentation Q&A assistant, trained to answer"
                " questions from LangSmith's documentation."
                " LangChain is a framework for building applications using large language models."
                "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
                ("system", "{context}"),
                ("human","{question}")
            ]
        ).partial(time=str(datetime.now()))

    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    response_generator = (
        prompt 
        | model 
        | StrOutputParser()
    )
    chain = (
        # The runnable map here routes the original inputs to a context and a question dictionary to pass to the response generator
        {
            "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question")
        }
        | response_generator
    )
    return chain

With the documents prepared, and the chain constructor ready, it’s time to create and evaluate our chains. We will vary the split size and overlap to evaluate its impact on the response quality.

text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
retriever = create_retriever(transformed, text_splitter)

chain_1 = create_chain(retriever)

# We will shrink both the chunk size and overlap
text_splitter_2 = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=500,
    chunk_overlap=50,
)
retriever_2 = create_retriever(transformed, text_splitter_2)

chain_2 = create_chain(retriever_2)

2.3 Evaluate the chains

At the moment, we’re still going through the standard development -> evaluation cycle. We have two candidates and will evaluate them using a LangChain correctness evaluator. We will produce projected responses to each question in the dataset and log feedback from the evaluator for that data point by running ‘run_on_dataset’.

from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    # We will use the chain-of-thought Q&A correctness evaluator
    evaluators=["cot_qa"],
)

results = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_1,
    evaluation=eval_config
)
project_name = results["project_name"]

View the evaluation results for project '883bc88d0e994280aee27945d7c65496-RunnableSequence' at:
https://smith.langchain.com/projects/p/ab79529c-e05a-4fe3-b290-1b62523b4572?eval=true

results_2 = client.run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain_2,
    evaluation=eval_config
)
project_name_2 = results_2["project_name"]

View the evaluation results for project '476151a4fdcd468b8f84b3312f61345a-RunnableSequence' at:
https://smith.langchain.com/projects/p/abddba69-614f-4d2c-8dd3-96dbd8942b67?eval=true

Now you should have two test run projects over the same dataset. If you click on one, it should look something like the following:

You can look at the aggregate results here and for the other project to compare them, but let’s move on to the pairwise comparison.

3 Pairwise Evaluation

Assume that when both approaches are assessed separately, they yield similar results.

We can use a pairwise evaluator to see how well we can anticipate favoured outcomes. To begin, we will construct a couple of helper functions that will be used to run the evaluator on each prediction pair. Let’s dissect this function:

The function accepts a dataset example and loads the predictions of each model on that data point.
It then invokes the evaluator and randomises the order of the predictions. This is done to account for any ordering bias in the evaluator LLM.
When the evaluation result is returned, we validate it and then log feedback for both models.

Once this is complete, the values are all returned so we can display them in a table in the notebook below.

import random
import logging

def _get_run_and_prediction(example_id, project_name):
    run = next(client.list_runs(reference_example_id=example_id, project_name=project_name))
    prediction = next(iter(run.outputs.values()))
    return run, prediction

def _log_feedback(run_ids):
    for score, run_id in enumerate(run_ids):
        client.create_feedback(run_id, key="preference", score=score)

def predict_preference(example, project_a, project_b, eval_chain):
    example_id = example.id
    run_a, pred_a = _get_run_and_prediction(example_id, project_a)
    run_b, pred_b = _get_run_and_prediction(example_id, project_b)
    input_, answer = example.inputs['question'], example.outputs['answer']
    result = {"input": input_, "answer": answer, "A": pred_a, "B": pred_b}

    # Flip a coin to average out persistent positional bias
    if random.random() < 0.5:
        result['A'], result['B'] = result['B'], result['A']
        run_a, run_b = run_b, run_a
    try:
        eval_res = eval_chain.evaluate_string_pairs(
            prediction=result['A'],
            prediction_b=result['B'],
            input=input_, 
            reference=answer
        )
    except Exception as e:
        logging.warning(e)
        return result

    if eval_res["value"] is None:
        return result

    preferred_run = (run_a.id, "A") if eval_res["value"] == "A" else (run_b.id, "B")
    runner_up_run = (run_b.id, "B") if eval_res["value"] == "A" else (run_a.id, "A")
    _log_feedback((runner_up_run[0], preferred_run[0]))
    result["Preferred"] = preferred_run[1]
    return result

For this example, we will use the labeled_pairwise_string evaluator from LangChain off-the-shelf. By default, instructs the evaluation llm to choose the preference based on helpfulness, relevance, correctness, and depth of thought. In your case, you will likely want to customize the criteria used!

For more information on how to configure it, check out the Labeled Pairwise String Evaluator documentation and inspect the resulting traces when running this notebook.

from langchain.evaluation import load_evaluator

pairwise_evaluator = load_evaluator("labeled_pairwise_string")

import functools
from langchain.schema.runnable import RunnableLambda


eval_func = functools.partial(
    predict_preference,
    project_a=project_name,
    project_b=project_name_2,
    eval_chain=pairwise_evaluator,
)


# We will wrap in a lambda to take advantage of its default `batch` convenience method
runnable = RunnableLambda(eval_func)

examples = list(client.list_examples(dataset_name="Retrieval QA Questions"))
values = runnable.batch(examples)

WARNING:root:Invalid verdict: Final decision: [[A. Verdict must be one of 'A', 'B', or 'C'.

The “preference” feedback was automatically logged to the test projects you created in step 3 by calling the method above. The image below shows the same test run as before, but with the preference scores added. This model appears to be less popular than the other!

Because the ‘predict_preference’ function we built above is configured to not log input in the event of a tie, some of the cases do not have a corresponding preference score. You are free to modify this behaviour as you see fit.

In the app, you may also check the feedback findings for the other test run to determine how well the evaluator’s results match your expectations.

import pandas as pd

df = pd.DataFrame(values)

from IPython.core.display import HTML

def word_wrap_on_hover(df):
    """
    Takes a Pandas DataFrame and returns an HTML table with word wrap on hover.
    """
    styles = """
    <style>
        .hover_table td {
            max-width: 200px; /* You can adjust this value */
            overflow: hidden;
            text-overflow: ellipsis;
            white-space: nowrap;
        }
        .hover_table td:hover {
            white-space: normal;
            word-wrap: break-word;
        }
    </style>
    """
    html_table = df.to_html(classes='hover_table')
    return HTML(styles + html_table)

word_wrap_on_hover(df)

	input	answer	A	B	Preferred
0	How do I move my project between organizations?	LangSmith doesn't directly support moving projects between organizations.	Currently, LangSmith does not support project migration between organizations. If you want to move your project to a different organization, the recommended approach is to create a new project within the desired organization and manually transfer the necessary data, such as runs and datasets, to the new project.\n\nYou can achieve this by using the LangSmith SDK or REST API to export the data from the original project and import it into the new project. Here are the general steps you can follow:\n\n1. Export the runs and datasets from the original project using the LangSmith API or SDK. You can filter the runs and datasets based on your requirements.\n\n2. Save the exported data to a local file or storage location.\n\n3. Create a new project in the desired organization using the LangSmith web app or API.\n\n4. Import the exported runs and datasets into the new project using the LangSmith API or SDK. Make sure to map the data correctly to the new project's structure.\n\nBy following these steps, you can effectively move your project from one organization to another.	Currently, LangSmith does not support project migration between organizations. However, you can manually export and import runs and datasets using the SDK to imitate project migration. Here's how you can do it:\n\n1. Export Runs: Use the LangSmith SDK to export the runs from your current organization. You can find examples of exporting runs in the LangSmith SDK documentation.\n\n2. Export Datasets: Similarly, use the LangSmith SDK to export the datasets associated with your project.\n\n3. Create a New Project: In the new organization, create a new project where you want to move your project to.\n\n4. Import Runs: Use the LangSmith SDK to import the exported runs into the new project. Again, you can find examples of importing runs in the LangSmith SDK documentation.\n\n5. Import Datasets: Finally, use the LangSmith SDK to import the exported datasets into the new project.\n\nBy following these steps, you can effectively move your project from one organization to another.	A
1	Why do I have to set environment variables?	Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith. While there are other ways to connect, environment variables tend to be the simplest way to configure your application.	Setting environment variables is a common practice in software development for configuring and customizing the behavior of applications. In the context of LangSmith and LangChain, environment variables are used to enable and configure tracing, which allows you to log and monitor the execution of your language models.\n\nBy setting the `LANGCHAIN_TRACING_V2` environment variable, you enable tracing for your LangChain applications. This allows you to log the input, output, and other relevant information about each run of your language model. Tracing can be useful for debugging, understanding model behavior, and monitoring the performance of your applications.\n\nSetting environment variables provides a flexible and configurable way to enable tracing without modifying your code. It allows you to control tracing behavior across different environments and deployments without the need for code changes.	Setting environment variables is a common practice in software development to configure the behavior of applications or libraries. In the case of LangSmith, setting environment variables is necessary to enable tracing and logging of run traces to LangSmith.\n\nBy setting the `LANGCHAIN_TRACING_V2` environment variable, LangChain components such as LLMs, chains, agents, tools, and retrievers will automatically log their calls and responses to LangSmith. This allows you to easily debug and monitor the behavior of your application, track inputs and outputs, analyze latency and token usage, and more.\n\nUsing environment variables provides a flexible and consistent way to configure tracing across different programming languages and environments. It allows you to enable or disable tracing without modifying your code, making it easier to switch between development, testing, and production environments.\n\nWhile setting environment variables may require some initial setup, it provides a powerful and convenient way to leverage the capabilities of LangSmith for debugging and monitoring your LangChain applications.	B
2	Can I trace my Llama V2 llm?	So long as you are using one of LangChain's LLM implementations, all your calls can be traced	Yes, you can trace your Llama V2 LLM using LangSmith's tracing feature. Tracing allows you to log the runs of your LLM application and inspect the inputs and outputs of each component in the chain. To enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your application. Here's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ["LANGCHAIN_TRACING_V2"] = "true"\n\nllm = LlamaV2()\nllm.invoke("Hello, world!")\n```\n\nMake sure you have also set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables to the appropriate values for your LangSmith account.	Yes, you can trace your Llama V2 LLM using LangSmith. Tracing allows you to log the inputs, outputs, and other important information of your LLM runs to LangSmith for visualization and debugging purposes.\n\nTo enable tracing for your Llama V2 LLM, you need to set the `LANGCHAIN_TRACING_V2` environment variable to `true` before running your LLM code. Here's an example of how to enable tracing in Python:\n\n```python\nimport os\nfrom langchain.chat_models import LlamaV2\n\nos.environ["LANGCHAIN_TRACING_V2"] = "true"\n\nllm = LlamaV2()\nllm.invoke("Hello, world!")\n```\n\nMake sure you have the LangSmith SDK installed and your API key and project name configured correctly. The traces from your Llama V2 LLM runs will be logged to LangSmith and can be visualized in the LangSmith web app.\n\nNote that tracing is asynchronous, so you may need to wait for the traces to be submitted before exiting your application. You can use the `wait_for_all_tracers()` function from the `langsmith.callbacks.tracers.langchain` module to ensure all traces are logged before exiting.	B
3	How do I use a traceable decorator?	The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,import the required function, decorate your function, and then call the function. Below is an example:\n```python\nfrom langsmith.run_helpers import traceable\n@traceable(run_type="chain") # or "llm", etc.\ndef my_function(input_param):\n # Function logic goes here\n return output\nresult = my_function(input_param)\n```	To use the `traceable` decorator, you need to import it from the `langsmith.run_helpers` module. The `traceable` decorator allows you to easily log the execution of a function as a run in LangSmith.\n\nHere's an example of how to use the `traceable` decorator:\n\n```python\nfrom langsmith.run_helpers import traceable\n\n@traceable(run_type="llm", name="openai.ChatCompletion.create")\ndef my_llm(args, *kwargs):\n # Your function logic here\n pass\n```\n\nIn the example above, the `traceable` decorator is applied to the `my_llm` function. The `run_type` parameter specifies the type of the run (e.g., "llm", "tool", "chain"), and the `name` parameter specifies the name of the run. You can customize these parameters based on your specific use case.\n\nOnce the `traceable` decorator is applied, every time the `my_llm` function is called, a run will be logged to LangSmith with the specified run type and name.\n\nNote that the `traceable` decorator works for both synchronous and asynchronous functions. If you're using an asynchronous function, make sure to use the `await` keyword when calling the function.\n\nYou can find more information about the `traceable` decorator in the LangSmith documentation under the "Tracing" section.	To use the `traceable` decorator in LangSmith, you can follow these steps:\n\n1. Import the necessary modules:\n```python\nfrom langsmith.run_helpers import traceable\n```\n\n2. Define your function and decorate it with the `traceable` decorator:\n```python\n@traceable(run_type="llm")\ndef my_function(arg1, arg2):\n # Function logic here\n return result\n```\nThe `run_type` argument specifies the type of the run, such as "llm", "chain", or "tool". You can choose the appropriate run type based on your use case.\n\n3. Call your function as usual:\n```python\nresult = my_function(arg1, arg2)\n```\n\nWhen you call the decorated function, it will automatically log a trace of the function execution to LangSmith. The trace will include the inputs, outputs, and other relevant information.\n\nNote that the `traceable` decorator works for both synchronous and asynchronous functions. If you're using an asynchronous function, make sure to use the `await` keyword when calling it.\n\nYou can find more examples and details in the [traceable notebook](https://github.com/langchain-ai/langsmith-cookbook/blob/main/tracing-examples/traceable/traceable.ipynb) in the LangSmith Cookbook.	B
4	What's a langsmith dataset?	A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.	A LangSmith dataset is a collection of input-output examples that are used for testing, evaluation, and training purposes in the LangSmith framework. It consists of pairs of input prompts and their corresponding model outputs. These examples can be created and curated during the development and debugging process of building applications using large language models (LLMs).\n\nLangSmith datasets are valuable for several reasons:\n\n1. Testing and Evaluation: Datasets can be used to test the performance of prompts, chains, agents, and other components of LLM applications. By running the examples in the dataset through the application, developers can observe the outputs and evaluate the effectiveness of their models.\n\n2. Debugging and Troubleshooting: When encountering unexpected or incorrect outputs, developers can add these examples to a dataset for further analysis and debugging. By examining the inputs and outputs in the dataset, they can identify issues and improve the application.\n\n3. Benchmarking and Comparison: Datasets provide a benchmark for comparing different versions of prompts, chains, or models. Developers can use the dataset to measure the performance of different configurations and make informed decisions about which version to use.\n\n4. Training and Fine-tuning: LangSmith datasets can also be exported for use in training or fine-tuning language models. These datasets can be used to improve the performance of the models by incorporating real-world examples and specific use cases.\n\nLangSmith provides features to easily create, edit, and manage datasets within the LangSmith framework. Developers can add examples to datasets during the debugging process, and these datasets can be exported for use in other contexts, such as OpenAI Evals or fine-tuning with external tools like FireworksAI.	A LangSmith dataset is a collection of examples that can be used to evaluate or improve a chain, agent, or model. It consists of rows, where each row represents an example and contains the inputs and (optionally) the expected outputs for a given interaction. Datasets in LangSmith can be created from existing runs, uploaded as CSV files, or manually created using the LangSmith client. Datasets are useful for testing and evaluating models, monitoring performance, and exporting data for use in other contexts.	NaN
5	How might I query for all runs in a project?	client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})	To query for all runs in a project, you can use the LangSmith SDK or the web app. Here's how you can do it using the SDK in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name="<your_project>"))\n```\n\nIn this example, replace `<your_project>` with the name of your project. The `list_runs` method retrieves all runs in the specified project.\n\nIf you prefer to use the web app, you can navigate to the project and view all the runs in the runs table. You can also use the search bar in the runs table to filter and search for specific runs based on various criteria.\n\nRemember to authenticate with your LangSmith credentials when using the SDK to access your projects and runs.	To query for all runs in a project, you can use the LangSmith SDK or the LangSmith REST API. Here's an example of how to do it using the LangSmith SDK in Python:\n\n```python\nfrom langsmith import Client\n\nclient = Client()\nruns = list(client.list_runs(project_name="Your Project Name"))\n```\n\nIn this example, replace "Your Project Name" with the name of your project. The `list_runs` method returns a generator that you can iterate over to get all the runs in the project.\n\nIf you prefer to use the LangSmith REST API directly, you can make a GET request to the `/runs` endpoint with the appropriate project name parameter.\n\nKeep in mind that you may need to paginate through the results if there are a large number of runs in the project. The LangSmith SDK provides convenient methods for pagination, such as `client.list_runs_paginated()`, which returns a paginated generator that you can iterate over.\n\nRemember to authenticate your requests with your API key.	B
6	What is LangChain?	LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.	LangChain is a framework for building applications using large language models (LLMs). It aims to simplify the process of building reliable LLM applications by providing tools and features to enhance development, testing, evaluation, and monitoring.\n\nLangChain helps bridge the gap between the initial setup of LLMs and their performance in production. It offers a tracing feature that logs all calls to LLMs, chains, agents, tools, and retrievers, providing visibility into the behavior of the application. This tracing feature can be used for debugging purposes, such as identifying unexpected end results, understanding why an agent is looping, or investigating the latency of a chain.\n\nThe framework also includes a playground feature that allows developers to modify prompts and observe the resulting changes to the output. This helps in experimenting with different inputs and understanding the impact on the application's behavior.\n\nLangChain provides a visualization of the sequence of events in complicated chains and agents, making it easier to understand the flow and interactions between different components. It also tracks the token usage of each step in the chain, helping developers identify potentially costly parts.\n\nAdditionally, LangChain supports collaborative debugging by allowing users to share faulty chains and LLM runs with colleagues for troubleshooting. It also enables the collection of examples to create datasets for testing and evaluation. Developers can add input/output examples to datasets at different steps of the chain, facilitating benchmarking and testing against known issues.\n\nFor evaluation purposes, LangChain allows the uploading of datasets and running chains over the data points. The results can be visualized and evaluated using built-in evaluators or custom evaluators. Feedback can be assigned to runs and marked as correct or incorrect, providing aggregate statistics for each test project.\n\nFinally, LangChain can be used for monitoring applications in production. It provides the ability to log traces, visualize latency and token usage statistics, and troubleshoot specific issues as they arise.\n\nOverall, LangChain aims to simplify the development, testing, evaluation, and monitoring of LLM applications, making it easier to build reliable and performant language-based applications.	LangChain is a framework for building applications using large language models (LLMs). It aims to simplify the process of developing reliable LLM applications by providing tools and features to enhance debugging, testing, evaluation, and monitoring.\n\nLangChain helps bridge the gap between the initial setup of LLMs and their performance in production. It offers a tracing feature that logs all calls to LLMs, chains, agents, tools, and retrievers, providing visibility into the inputs and outputs of each call. This tracing feature is enabled by default in LangChain.\n\nDebugging LLMs, chains, and agents can be challenging, but LangChain provides tools to address common pain points. It offers a visualization of the exact inputs and outputs of LLM calls, allowing developers to understand the formatting logic, transformations to user input, and missing input. LangChain also provides a playground where developers can modify prompts and observe the resulting changes to the output.\n\nIn complex chains and agents, it can be difficult to understand the sequence of events and the interactions between different calls. LangChain's tracing feature includes a visualization of the sequence of calls, helping developers gain insights into the inner workings of their applications.\n\nLangChain also helps identify performance issues by tracking the latency of each step in a chain. Developers can identify and optimize the slowest components of their applications. Additionally, LangChain tracks the token usage of each step, making it easy to identify potentially costly parts of the chain.\n\nCollecting examples of failures and unexpected outcomes is crucial for testing and improving LLM applications. LangChain allows developers to add input/output examples to datasets, which can be used for testing changes to prompts or chains. These datasets can be evaluated using LangChain's evaluators, providing guidance on examples that require further investigation.\n\nLangChain can also be used for monitoring LLM applications in production. It allows developers to log traces, visualize latency and token usage statistics, and troubleshoot specific issues as they arise. Feedback can be associated with runs, enabling performance tracking over time.\n\nOverall, LangChain aims to simplify the development, testing, evaluation, and monitoring of LLM applications, providing tools and features to enhance reliability and performance.	NaN

4 Conclusion

We compared two variations of a RAG Q&A chain in this post by predicting preference scores for each pair of predictions. This methodology is one method for automatically comparing two versions of a chain, which can provide context beyond ordinary benchmarking.

There are numerous methods for evaluating preferences. We compared the two models using binary choices and only evaluated once, but you might get better results if you attempt one of the following approaches:

Evaluate each position multiple times and return a victory rate
Evaluators of ensembles
Tell the model to produce continuous scores.
Tell the model to utilise a different prompt method than the chain of thinking.

For more information on measuring the reliability of this and other approaches, you can check out the evaluations examples in the LangChain repo.

5 Acknowledgements

I’d like to express my thanks to the wonderful Langsmith Cookbook Repo and acknowledge the use of some images and other materials from this project in writing this article.

Comparing Question and Answer LLM System Outputs

Subscribe