LivingDataLab - Question and Answering for Documents using LangChain

1 Introduction

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. But using LLMs in isolation is often not enough in practice to create a truly powerful or useful business application - the real power comes when you are able to combine them with other sources of computation, services or knowledge. LangChain is an intuitive open-source python framework created to simplify the development of useful applications using large language models (LLMs), such as OpenAI or Hugging Face.

In earlier articles we introduced the LangChain library and key components.

In this article, we look at how to use LangChain to peform question & answering over documents. This allows LLM’s to be able to use more data then they were trained on, which allows them to be much more useful and specific for a given use case. We will also look at more advanced uses of memory such as embeddings and vector stores. An example application of this might be a tool that would allow you to query a product catalog for items of interest.

2 Setup

We will use OpenAI’s ChatGPT LLM for our examples, so lets load in the required libraries.

import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

We will also import some LangChain objects.

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

3 Creating a Q&A Chain and Vector Index Quickly

We will import some sample data of a catalog of outdoor clothing to use. We can use the CSVLoader object to load this.

file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

We will also import VectorstoreIndexCreator which will help us create an index really easily.

from langchain.indexes import VectorstoreIndexCreator

To create the vector store we are going to specify 2 things, the vector store class, and then use the from_loaders method to load the data.

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

Now we are ready to query our data using text prompts already! Lets make an example query and submit it to the index (our data store).

query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

response = index.query(query)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.

display(Markdown(response))

Name	Description
Men’s Tropical Plaid Short-Sleeve Shirt	UPF 50+ rated, 100% polyester, wrinkle-resistant, front and back cape venting, two front bellows pockets
Men’s Plaid Tropic Shirt, Short-Sleeve	UPF 50+ rated, 52% polyester and 48% nylon, machine washable and dryable, front and back cape venting, two front bellows pockets
Men’s TropicVibe Shirt, Short-Sleeve	UPF 50+ rated, 71% Nylon, 29% Polyester, 100% Polyester knit mesh, wrinkle resistant, front and back cape venting, two front bellows pockets
Sun Shield Shirt by	UPF 50+ rated, 78% nylon, 22% Lycra Xtra Life fiber, wicks moisture, fits comfortably over swimsuit, abrasion resistant

All four shirts provide UPF 50+ sun protection, blocking 98% of the sun’s harmful rays. The Men’s Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant. The Men’s Plaid Trop

So we can see this has given us a nice table of results formatted in Markdown to our question.

We also have a nice summary undeneath.

4 LLM’s on Documents

LLM’s can only look at a few thousand words at a time from a document. So if we have really large documents, how do we get the language model to be able to respond appropriately to everything in a large document?

Embeddings and vector storage can help with this issue.

4.1 Embeddings

Embeddings create numerical representations for text. These numerical representations captures the semantic meaning of that text. Therefore, text with similar meaning will have similar numerical representations - or vectors.

In the example above, we can see the first two sentances are about pets - whereas the third is about a car. If we look at the representation of these numerically we can see the first two have very similar numbers compared to the last one. This helps us figure out which bits of text are similar, which will be very useful for when deciding which pieces of text we want to pass to the language model say to answer a question.

4.2 Vector Databases

A vector database is a way to store these numerical representations (or vectors) for each of our text pieces from our document or documents. When we get the text of a document we are doing to first break it up into smaller chunks, this creates bits of text that are smaller than the original document. This is useful as the document may be too large to pass in its entirety to the language model. By creating these smaller chunks, we can then pass only the most relevant parts of text to the language model. So we create embeddings for each of these chunks.

Once we have created this index, we can use it at run time to find the most relevant chunks of text to an incoming query. We create an embedding vector for an incoming query, and find the most similar embedding vectors to it in our index. Cosine similarity for example is a method to find the nearest vectors to a given vector.

These most relevant chunks can then be passed to the LLM in the prompt, to help provide the most useful and relevant context from the document for answering the query.

5 Creating a Q&A Chain and Vector Index Step by Step

5.1 Create the Vector Store

We will now create a question and answer chain using a vector index as we did previously, but now step by step to go over more of the details.

So as before we will use the CSVLoader to load the documents we want to do question and answering over.

loader = CSVLoader(file_path=file)

docs = loader.load()

If we look at one of the individual documents loaded, we can see that it corresponds to one of the product rows in the csv.

docs[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

Previously we talked about how useful it is to create document chunks. Because these particular documents are so small, we don’t actually need to do any document chunking in this case, so we can create embeddings directly for each of these documents. Do create these embeddings we are doing to use LangChains wrapper class for OpenAI OpenAIEmbeddings.

So if we want to see what these embeddings look like, lets take an example text and convert it.

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

embed = embeddings.embed_query("Hi my name is Harrison")

print(len(embed))

print(embed[:5])

[-0.021900920197367668, 0.006746490020304918, -0.018175246194005013, -0.039119575172662735, -0.014097143895924091]

We can see that these embeddings have 1536 numbers, here are the first 5 numbers for our example emedding.

So we want to create embeddings for all the text documents we loaded, and then store them in a vector store. We can do this using the from_documents method of the vector store object.

This takes set of documents and an embedding object, and creates a vector store.

db = DocArrayInMemorySearch.from_documents(
    docs,
    embeddings
)

We can now use this vector store to find the most similar document texts for an incoming query.

query = "Please suggest a shirt with sunblocking"

docs = db.similarity_search(query)

len(docs)

docs[0]

Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255})

5.2 Using the Vector Store to Answer Questions

So how can we use this vector store to do question answering over all our documents?

First we need to create a retriever from this vector store. A retriever is a generic method that takes a query and returns documents. Vector stores and embeddings are one way we can do this, but there are other methods.

Next, because we want to do text generation and create a natural language response to our query - we need to import a language model - for our example we will use OpenAI’s ChatGPT.

retriever = db.as_retriever()

llm = ChatOpenAI(temperature = 0.0)

If we were doing this by hand, we would combine the documents into a single piece of text into a variable, then pass this variable into a prompt as context for answering a question.

qdocs = "".join([docs[i].page_content for i in range(len(docs))])

response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.")

display(Markdown(response))

Name	Description
Sun Shield Shirt	High-performance sun shirt with UPF 50+ sun protection, moisture-wicking, and abrasion-resistant fabric. Recommended by The Skin Cancer Foundation.
Men’s Plaid Tropic Shirt	Ultracomfortable shirt with UPF 50+ sun protection, wrinkle-free fabric, and front/back cape venting. Made with 52% polyester and 48% nylon.
Men’s TropicVibe Shirt	Men’s sun-protection shirt with built-in UPF 50+ and front/back cape venting. Made with 71% nylon and 29% polyester.
Men’s Tropical Plaid Short-Sleeve Shirt	Lightest hot-weather shirt with UPF 50+ sun protection, front/back cape venting, and two front bellows pockets. Made with 100% polyester and is wrinkle-resistant.

All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun’s harmful rays. They are made with high-performance fabrics that are moisture-wicking, wrinkle-resistant, and abrasion-resistant. The Men’s Plaid Tropic Shirt and Men’s Tropical Plaid Short-Sleeve Shirt both have front/back cape venting for added breathability. The Sun Shield Shirt is recommended by The Skin Cancer Foundation.

Alternatively we can incorporate all of these steps into a LangChain chain that does Retrieval then question and answering: RetrievalQA. We pass into this a language model to do text generation at the end, then we specify the chain type ‘stuff’ which will just stuffs all of the documents into the context for the prompt Finally we pass in a retriever object, which is just a object we used as before for fetching the most relevant documents to pass to the language model.

Now we can create a query, and run this chain on that query.

qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

response = qa_stuff.run(query)



> Entering new RetrievalQA chain...

> Finished chain.

display(Markdown(response))

Shirt Number	Name	Description
618	Men’s Tropical Plaid Short-Sleeve Shirt	This shirt is made of 100% polyester and is wrinkle-resistant. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun’s UV rays.
374	Men’s Plaid Tropic Shirt, Short-Sleeve	This shirt is made with 52% polyester and 48% nylon. It is machine washable and dryable. It has front and back cape venting, two front bellows pockets, and is rated to UPF 50+.
535	Men’s TropicVibe Shirt, Short-Sleeve	This shirt is made of 71% Nylon and 29% Polyester. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun’s UV rays.
255	Sun Shield Shirt	This shirt is made of 78% nylon and 22% Lycra Xtra Life fiber. It is handwashable and line dry. It is rated UPF 50+ for superior protection from the sun’s UV rays. It is abrasion-resistant and wicks moisture for quick-drying comfort.

The Men’s Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun’s UV rays.

The Men’s Plaid Tropic Shirt, Short-Sleeve is made with 52% polyester and 48% nylon. It has front and back cape venting, two front bellows pockets, and is rated to UPF 50+.

The Men’s TropicVibe Shirt, Short-Sleeve is made of 71% Nylon and 29% Polyester. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun’s UV rays.

The Sun Shield Shirt is made of 78% nylon and 22% Lycra Xtra Life fiber. It is abrasion-resistant and wicks moisture for quick-drying comfort. It is rated UPF 50+ for superior protection from the sun’s UV rays.

So thats how you might do it in detail, but we can of course use the one line method as before. So thats the great thing about LangChain, you can use either a more concise or more detailed call to specify your chain. The more detailed calls of course allow you to customise more about the specifics going on.

response = index.query(query, llm=llm)

We can also customise the index when we create it. When we created it by hand we specified the ChatGPT embeddings, which gives us flexibility over how the embeddings are created and also allows us to use different types of vector store.

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

6 Alternative Methods to Populate Prompt Context

We used the stuff method previously to populate the prompt which is the simplest method but has various pros and cons and is not always the best solution. For example when we fetched the documents in our case each document was relatively small, but this might not work so well for bigger or multiple documents.

But what if you wanted to do the same kind of question answering over lots of different types of chunks? There are a few other methods we could use.

Map_reduce takes all the chunks, passes them with the question to a language model, gets a response. Then uses another LLM call to summerise all the individual document responses into a final answer. This is really powerful as it can operate over any number of documents, and its also powerful as you could do all the individual questions in parallel. But it does take more calls so could be more expensive for a paid service like OpenAI, and it does treat all the documents as independant which may not be the most desired approach for a use case.

Refine is also used to run over all the chunks, but it does so iteratively and builds upon the answer from the previous document. So this is really good for combining information and building up an answer over time. It will generally take longer to execute, and lead to longer answers. And this also takes as many calls as Map_reduce.

Map_rerank is a more experimental method, where you do a single call to a language model for each document and you also ask it to return a score in the same call - and you use the highest score to select the best answer. But this relies on the model to know what the score should be. And like Map_reduce its relatively fast. But you are making a load of calls so it will be more expensive.

The most commonly method is actually the simple stuff method. The second most common method is Map_reduce. These methods can be used for many other chains beyond question answering, for example a common use case for Map_reduce is text summerisation where you have a really long document and you want to recursively summerise documents.

7 Acknowledgements

I’d like to express my thanks to the wonderful LangChain for LLM Application Development Course by DeepLearning.ai - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Question and Answering for Documents using LangChain

Subscribe