Langsmith for LLM Application Evaluation & Monitoring

Developing LLM based applications is now possible using libraries such as Langchain, but taking these applications into production can involve many challenges such as evaluation & monitoring. Langsmith is a new tool that can help with these challenges of taking LLMs into production.

Pranath Fernando


August 16, 2023

1 Introduction

LangChain simplifies the development of LLM apps and Agents. However, getting LLM applications into production can be tricky. To produce a high-quality result, you will most likely need to heavily customise and iterate on your prompts, chains, and other components.

LangSmith is a unified platform that supports debugging, testing, and monitoring of your LLM applications.

In particular it can help with the following:

  • Debug a new chain, agent, or collection of tools quickly.
  • Visualise and utilise components (chains, llms, retrievers, etc.)
  • Evaluate multiple prompts and LLMs for a single component - Run a given chain several times over a dataset to ensure it consistently meets a quality bar
  • Capture use traces and generate insights using LLMs or analytics pipelines

2 Langsmith Overview

This graphic by Harry Zhang gives a good overview of where Langsmith fits into the overall LLM application eco-system at date of writing this article:

Let’s now examine in a bit more detail each of these use-cases for Langsmith.

2.1 Debugging

It can be difficult to debug LLMs, chains, and agents. LangSmith assists in resolving the following issues:

What was the exact input to the LLM?

LLM calls are frequently difficult and non-deterministic. The inputs/outputs may appear simple because they are technically string string (or chat messages chat message), but this can be misleading because the input string is typically produced from a combination of user input and auxiliary functions.

The majority of inputs to an LLM call are a combination of a set template and input variables. These input variables could be generated directly by user input or by an auxiliary function (such as retrieval). These input variables will have been transformed to a string format by the time they enter the LLM, but they are not always naturally expressed as a string. It is critical to have visibility into the final string entering the LLM.

This is also true, to a lesser extent, of the output of an LLM. The output of an LLM is frequently a string, but that string may have some structure (json, yaml) that is meant to be processed into a structured form. Understanding the actual result can help determine whether different parsing is required.

LangSmith visualises the specific inputs and outputs of all LLM calls so that you can readily comprehend them.

If I edit the prompt, how does that affect the output?

So you see a poor output and go into LangSmith to investigate. You locate the erroneous LLM call and are now inspecting the actual input. What if you want to experiment with modifying a word or phrase to see what happens?

When evaluating an LLM call, you may visit this playground by clicking the Open in Playground button. Here, you can adjust the prompt and re-run it to see how the output changes - as many times as you need!

Currently, this feature supports only OpenAI and Anthropic models and works for LLM and Chat Model calls

What is the exact sequence of events?

It can be difficult to understand what is going on under the hood of complex chains and agents. What kinds of calls are being made? What is the order? What are each call’s inputs and outputs?

LangSmith’s built-in tracing tool provides a visual representation of these sequences. This tool is quite useful for comprehending complex and extensive chains and agents. It can shed light on the sequencing of calls and how they interact in chains. It helps visualise the precise sequence for a given run for agents when the sequence of calls is non-deterministic – something that is impossible to anticipate ahead of time.

Why did my chain take much longer than expected?

If a chain takes longer than intended, you must determine the root cause. LangSmith identifies and potentially eliminates the slowest components by tracking the latency of each step.

What was the total number of tokens used?

Developing and prototyping LLM apps can be costly. LangSmith keeps track of a chain’s total token usage as well as the token usage of each step. This makes identifying potentially costly elements of the chain simple.

Debugging collaboratively

Sharing a defective chain with a coworker for debugging was previously difficult when done locally. We’ve introduced a “Share” button to LangSmith, which makes the chain and LLM runs available to anyone with the shared URL.

2.2 Collecting examples

The majority of the time, we go to debug because something awful or unexpected has happened in our programme. These failures provide crucial information. We can test future chain versions against these known vulnerabilities by identifying how our chain can fail and monitoring these failures.

Why is this so powerful? When developing LLM apps, it is typical to begin without any form of dataset. This is one of the benefits of LLMs. They are fantastic zero-shot learners, allowing you to get started as quickly as possible. But this can also be a burden, as you’re blindly adjusting the prompt. You don’t have any examples to compare your modifications against.

LangSmith solves this issue by having a “Add to Dataset” button for each run, making it simple to add input/output samples from a selected dataset. Before adding the example to the dataset, you can change it to include the desired result, which is very useful for poor examples.

This functionality is available at every level of a layered chain, allowing you to add examples for an end-to-end chain, an intermediary chain (such as an LLM Chain), or the LLM or Chat Model.

End-to-end chain examples are good for evaluating the overall flow, whereas single, modular LLM Chain or LLM/Chat Model examples are good for testing the simplest and most directly modifiable components.

2.3 Testing & evaluation

Initially, we perform the majority of our evaluations manually and ad hoc. We experiment with various inputs to see what happens. But, at some time, our application will be running well, and we will want to be more stringent about testing updates. We can make use of a dataset that we created along the route. Alternatively, we may spend some time manually building a small dataset. LangSmith makes dataset uploading easier in these instances.

How can we utilise a dataset to test modifications to a prompt or chain after we get it? The simplest method is to run the chain over the data points and visualise the results. Despite technological developments, there is still no replacement for inspecting outputs by hand. Currently, the chain must be executed client-side over the data points. The LangSmith client makes it simple to download a dataset and then run a chain over it, logging the results to a new project linked to the dataset. You can then go over them. Langsmith it simple to assign feedback to runs and label them as correct or incorrect directly in the web app, and aggregate data for each test project are displayed.

Langsmith provided a set of evaluators to the open-source LangChain library to make evaluating these runs easier. When a test run is started, certain evaluators are provided, and the results are evaluated once the test run is over. To be honest, most of these evaluators aren’t flawless. We do not recommend blindly trusting them. However, we believe they are beneficial for directing your attention to cases that you should look at. This becomes especially useful as the quantity of data points grows and it becomes impossible to examine each one individually.

2.4 Monitoring

After all of this, your app may be ready to go into production. LangSmith can be used to monitor your programme in the same way as it can be used to debug it. All traces can be logged, latency and token consumption data can be viewed, and individual issues may be troubleshooted as they emerge. String tags or key-value metadata can be supplied to each run, allowing you to attach correlation ids or AB test variants and filter runs accordingly.

Langsmith has made it easy to programmatically associate feedback with runs. This implies that if your app has a thumbs up/down button, you may use it to send feedback to LangSmith. This can be used to track performance over time and identify underperforming data points, which can then be added to a dataset for future testing – similar to how debug mode works.

In the LangSmith documentation, they presented various examples of deriving insights from logged runs. In addition to leading you through the process of executing this assignment yourself, we will also show you examples of interfacing with third parties for this reason.

2.5 Exporting datasets

LangSmith simplifies the process of curating datasets. These, however, are not only valuable within LangSmith; they can also be exported for usage in other settings. Exporting for usage in OpenAI Evals or fine-tuning, such as using FireworksAI, are notable applications. This also ensures one does’nt become too tied-in on this platform, as you can export the datasets created.

3 Import Libs & Setup

To use Langsmith you first need to create an account which is free - but at time of writing is in beta development so it might take some time to get an account (I had to wait a couple of weeks).

To begin, set your environment variables to instruct LangChain to log traces. Set the LANGCHAIN_TRACING_V2 environment variable to true to accomplish this. Set the LANGCHAIN_PROJECT environment variable to instruct LangChain which project to log to (if not set, runs will be recorded to the default project). If the project does not exist, this will create it for you. The LANGCHAIN_ENDPOINT and LANGCHAIN_API_KEY environment variables must also be configured.

Please see the LangSmith documentation for more information on different methods of configuring tracing.

To run this project, you must set the OPENAI_API_KEY and SERPAPI_API_KEY environment variables.

import os
from uuid import uuid4
from dotenv import load_dotenv


unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Tracing Test Walkthrough - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = ""
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY")  # Update to your API key

# Used by the agent in this tutorial
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["SERPAPI_API_KEY"] = os.getenv("SERPAPI_API_KEY")

4 Using Langsmith to Log Run Information

We are going to use a simple example of using an Agent to answer a few questions, and we want to log the outputs in Langsmith.

First we need to create the langsmith client to interact with the API

from langsmith import Client

client = Client()

Following that, we build a LangChain component and log runs to the platform. In this example, we will develop a ReAct-style agent that has access to the tools Search and Calculator. LangSmith works with any LangChain component (LLMs, Chat Models, Tools, Retrievers, and Agents are all supported).

from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentType, initialize_agent, load_tools

llm = ChatOpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False

We are running the agent concurrently on multiple inputs to reduce latency. Runs get logged to LangSmith in the background so execution latency is unaffected.

import asyncio

inputs = [
    "How many people live in canada as of 2023?",
    "who is dua lipa's boyfriend? what is his age raised to the .43 power?",
    "what is dua lipa's boyfriend age raised to the .43 power?",
    "how far is it from paris to boston in miles",
    "what was the total number of points scored in the 2023 super bowl? what is that number raised to the .23 power?",
    "what was the total number of points scored in the 2023 super bowl raised to the .23 power?",
    "how many more points were scored in the 2023 super bowl than in the 2022 super bowl?",
    "what is 153 raised to .1312 power?",
    "who is kendall jenner's boyfriend? what is his height (in inches) raised to .13 power?",
    "what is 1213 divided by 4345?",
results = []

async def arun(agent, input_example):
        return await agent.arun(input_example)
    except Exception as e:
        # The agent sometimes makes mistakes! These will be captured by the tracing.
        return e

for input_example in inputs:
    results.append(arun(agent, input_example))
results = await asyncio.gather(*results)
from langchain.callbacks.tracers.langchain import wait_for_all_tracers

# Logs are submitted in a background thread to avoid blocking execution.
# For the sake of this tutorial, we want to make sure
# they've been submitted before moving on. This is also
# useful for serverless deployments.

Our agent traces should show up in the Projects section of Langsmith, as we can see below in the drill through screenshots.

5 Evaluate another agent implementation

LangSmith allows you to test and assess your LLM apps in addition to logging runs.

We will use LangSmith in this part to construct a benchmark dataset and run AI-assisted assessors on an agent. We will accomplish this in a few steps:

  1. Make a dataset from previously ran inputs and outputs.
  2. Create a new agent for benchmarking.
  3. Set up evaluators to grade the output of an agent.
  4. Run the agent through the dataset and assess the outcomes

5.1 Create a LangSmith dataset

The LangSmith client is used to construct a dataset from the agent runs you just logged above. These will be used subsequently to assess the performance of a new agent. This is basically recording the inputs and results of the runs to a dataset as examples. A dataset is a collection of samples, which are simply input-output pairs that you may use to test your application.

Please keep in mind that this is a simple walkthrough example. In practise, you should test the outputs before adding them to a benchmark dataset to be used to evaluate other agents.

Please see the LangSmith documentation for additional information about datasets.

dataset_name = f"calculator-example-dataset-{unique_id}"

dataset = client.create_dataset(
    dataset_name, description="A calculator example dataset"

runs = client.list_runs(
    execution_order=1,  # Only return the top-level runs
    error=False,  # Only runs that succeed
for run in runs:
    client.create_example(inputs=run.inputs, outputs=run.outputs,

5.2 Initialize a new agent to benchmark

Any LLM, chain, or agent can be evaluated. Because chains might have memory, we will give a chain_factory (aka a constructor) function to each call to initialise it. In this scenario, we’ll put OpenAI’s function calling endpoints to the test.

from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentType, initialize_agent, load_tools

llm = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# Since chains can be stateful (e.g. they can have memory), we provide
# a way to initialize a new chain for each row in the dataset. This is done
# by passing in a factory function that returns a new chain for each row.
def agent_factory():
    return initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=False)

# If your chain is NOT stateful, your factory can return the object directly
# to improve runtime performance. For example:
# chain_factory = lambda: agent

5.3 Configure evaluation

Manually comparing chain results in the UI is effective, but time consuming. To analyse the performance of your component, you can use automated metrics and AI-assisted feedback.

In the following sections, we will develop several pre-implemented run evaluators that do the following:

  • Contrast results with ground truth labels. (You did this using the debug outputs mentioned above.)
  • Using embedding distance, assess semantic (dis)similarity.
  • In a reference, evaluate ‘aspects’ of the agent’s response.-free method employing custom criteria

Please see the LangSmith documentation for a more in-depth discussion of how to choose an acceptable evaluator for your use case and how to develop your own custom evaluators.

from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig

evaluation_config = RunEvalConfig(
    # Evaluators can either be an evaluator type (e.g., "qa", "criteria", "embedding_distance", etc.) or a configuration for that evaluator
        # Measures whether a QA response is "Correct", based on a reference answer
        # You can also select via the raw string "qa"
        # Measure the embedding distance between the output and the reference answer
        # Equivalent to: EvalConfig.EmbeddingDistance(embeddings=OpenAIEmbeddings())
        # Grade whether the output satisfies the stated criteria. You can select a default one such as "helpfulness" or provide your own.
        # Both the Criteria and LabeledCriteria evaluators can be configured with a dictionary of custom criteria.
                "fifth-grader-score": "Do you have to be smarter than a fifth grader to answer this question?"
    # You can add custom StringEvaluator or RunEvaluator objects here as well, which will automatically be
    # applied to each prediction. Check out the docs for examples.

5.4 Run the agent and evaluators

To evaluate your model, use the arun_on_dataset (or synchronous run_on_dataset) function. This will result in:

  1. Retrieve an example row from the provided dataset.
  2. Execute your llm or chain on each of the examples.
  3. To create automated feedback, apply evaluators to the resultant run traces and accompanying reference instances.

The outcomes will be displayed in the LangSmith app.

from langchain.smith import (
    run_on_dataset,  # Available if your chain doesn't support async calls.

chain_results = await arun_on_dataset(
    tags=["testing-notebook"],  # Optional, adds a tag to the resulting chain runs

# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.
# These are logged as warnings here and captured as errors in the tracing UI.
View the evaluation results for project '0eadab4e465c49ad874acc2bce4007e5-AgentExecutor' at:
Processed examples: 1Processed examples: 2Processed examples: 6Processed examples: 9
Chain failed for example 05d50196-3735-4510-a4b0-ef91248e32cd. Error: LLMMathChain._evaluate("
age_of_Dua_Lipa_boyfriend ** 0.43
") raised error: 'age_of_Dua_Lipa_boyfriend'. Please try again with a valid numerical expression
Chain failed for example bd1d5389-b01f-4cac-9600-260c78f99e47. Error: LLMMathChain._evaluate("
age ** 0.43
") raised error: 'age'. Please try again with a valid numerical expression
Chain failed for example 5262e85a-3ee9-442d-86e8-dc4cce55267c. Error: Too many arguments to single-input tool Calculator. Args: ['height ^ 0.13', {'height': 70}]

5.5 Review the test results

You may see the test results tracing UI below by going to the “Datasets & Testing” page, picking the “calculator-example-dataset-*” dataset, clicking on the Test Runs tab, and inspecting the runs in the appropriate project.

This will display the fresh runs as well as the feedback logged from the specified evaluators. Runs that fail will not receive feedback.

6 Exporting datasets and runs

LangSmith’s online interface allows you to instantly export data to common formats such as CSV or JSONL. The client can also be used to retrieve runs for additional analysis, storage in your own database, or sharing with others. Let’s get the evaluation run’s run traces.

runs = list(client.list_runs(dataset_name=dataset_name))
Run(id=UUID('5a09233f-8a51-4ec8-a1ea-096caf45469b'), name='AgentExecutor', start_time=datetime.datetime(2023, 8, 21, 17, 48, 49, 539515), run_type='chain', end_time=datetime.datetime(2023, 8, 21, 17, 48, 54, 191896), extra={'runtime': {'library': 'langchain', 'runtime': 'python', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'sdk_version': '0.0.25', 'library_version': '0.0.270', 'runtime_version': '3.9.13', 'langchain_version': '0.0.270'}, 'total_tokens': 512, 'prompt_tokens': 451, 'completion_tokens': 61}, error=None, serialized=None, events=[{'name': 'start', 'time': '2023-08-21T17:48:49.539515'}, {'name': 'end', 'time': '2023-08-21T17:48:54.191896'}], inputs={'input': 'what is 1213 divided by 4345?'}, outputs={'output': '1213 divided by 4345 is approximately 0.2792.'}, reference_example_id=UUID('8459f7e6-55e4-43cf-8999-86d48dea2635'), parent_run_id=None, tags=['testing-notebook', 'openai-functions'], execution_order=1, session_id=UUID('5cbabd18-19f5-4153-a2aa-b329b3ca3da6'), child_run_ids=[UUID('74380efc-3a56-48f8-bd07-d21e5cde23cb'), UUID('d4799bc4-47ae-4474-b524-ff202239befe'), UUID('0b30cbd9-2c61-4302-bf52-5a2762158a23'), UUID('25c15cf9-354e-411c-898c-8ee99c249e8d'), UUID('0fdae1e0-919e-45db-8d01-6e22348e1f8c'), UUID('71627b18-91c9-4655-a832-978fbc775b0a')], child_runs=None, feedback_stats={'correctness': {'n': 1, 'avg': 1.0, 'mode': 1}, 'helpfulness': {'n': 1, 'avg': 1.0, 'mode': 1}, 'fifth-grader-score': {'n': 1, 'avg': 0.0, 'mode': 0}, 'embedding_cosine_distance': {'n': 1, 'avg': 0.1442128576232, 'mode': 0.1442128576232}}, app_path='/projects/p/5cbabd18-19f5-4153-a2aa-b329b3ca3da6/r/5a09233f-8a51-4ec8-a1ea-096caf45469b')
{'correctness': {'n': 6, 'avg': 0.8333333333333334, 'mode': 1},
 'helpfulness': {'n': 6, 'avg': 1.0, 'mode': 1},
 'fifth-grader-score': {'n': 6, 'avg': 0.6666666666666666, 'mode': 1},
 'embedding_cosine_distance': {'n': 6, 'avg': 0.09463849470261249, 'mode': 0}}

So in this post, we have introduced Langsmith and succesfully traced and evaluated an agent using it.

7 Langsmith Hub

The Hub was added to Langsmith on September 5th 2023 and is described well in this article.

In essence, Langsmith Hub lets you discover, share, and version control prompts for LangChain and LLMs in general. It’s a great place to find inspiration for your own prompts, or to share your own prompts with the world. More documentation can be found here.

In the article the Langchain team explain the reasons for developing the hub and what they hope it will provide:

As LangChain and the broader ecosystem has evolved, the role of prompting has only become more important to the LLM development process. As Ethan Mollick recently wrote in a FANTASTIC article on the topic, “now is the time for grimoires.” By “grimoires” he means “prompt libraries that encode the expertise of their best practices into forms that anyone can use.” We whole-heartedly agree–the value of a Hub extends beyond individual applications. It’s about advancing our collective wisdom and translating that into knowledge we can all put to use now. We want to help make this easier on an individual, team, and organization scale, across any use-case and every industry. Our goal for LangChain Hub is that it becomes the go-to place for developers to discover new use cases and polished prompts.

8 Acknowledgements

I’d like to express my thanks to the wonderful Langsmith Documentation and acknowledge the use of some images and other materials from the documentation in this article.