Exploring The Role of LangChain’s Indexes and Retrievers

With an emphasis on the function of indexes and retrievers - here we will examine some of the benefits and drawbacks of employing document-based LLMs that use these
natural-language-processing
deep-learning
langchain
activeloop
openai
retrievers
Author

Pranath Fernando

Published

August 7, 2023

1 Introduction

In LangChain, retrievers and indexes are essential for organising documents and obtaining relevant data for LLMs. With an emphasis on the function of indexes and retrievers, we will examine some of the benefits and drawbacks of employing document-based LLMs (i.e., LLMs that incorporate pertinent documents inside their prompts).

A retriever uses the index to find and return relevant documents in answer to user queries. An index is a potent data structure that painstakingly organises and saves documents to facilitate efficient searching. The main index types in LangChain are based on vector databases, with embeddings-based indexes being the most common.

2 Import Libs & Setup

Here, we load a text file using the TextLoader class. Keep in mind to use the following command to install the necessary packages: pip install deeplake openai tiktoken langchain==0.0.208.

from langchain.document_loaders import TextLoader
import os
import openai
import sys
sys.path.append('../..')
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

3 Retrievers

Retrievers concentrate on removing pertinent documents to combine with language model suggestions. A retriever exposes a method called get_relevant_documents that takes a query string as input and returns a list of documents that are connected to it.

# text to write to a local file
# taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text = """Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or
Meta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,
PaLM is a flexible system that can potentially carry out all sorts of text generation and
editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for
example, or you could use it for tasks like summarizing text or even writing code.
(It’s similar to features Google also announced today for its Workspace apps like Google
Docs and Gmail.)
"""

# write text to local file
with open("docs/my_file.txt", "w") as file:
    file.write(text)

# use TextLoader to load text from local file
loader = TextLoader("docs/my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))
1

Then, we use CharacterTextSplitter to split the docs into texts.

from langchain.text_splitter import CharacterTextSplitter

# create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# split documents into chunks
docs = text_splitter.split_documents(docs_from_file)

print(len(docs))
Created a chunk of size 373, which is longer than the specified 200
2

These embeddings allow us to effectively search for documents or portions of documents that relate to our query by examining their semantic similarities.

from langchain.embeddings import OpenAIEmbeddings

# Before executing the following code, make sure to have
# your OpenAI key saved in the “OPENAI_API_KEY” environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

4 DeepLake Vector Store

We’ll employ the Deep Lake vector store with our embeddings in place.

Deep Lake provides several advantages over the typical vector store:

  • It’s multimodal, which means that it can be used to store items of diverse modalities, such as texts, images, audio, and video, along with their vector representations.
  • It’s serverless, which means that we can create and manage cloud datasets without the need to create and managing a database instance. This aspect gives a great speedup to new projects.
  • It’s possible to easily create a streaming data loader out of the data loaded into a Deep Lake dataset, which is convenient for fine-tuning machine learning models using common frameworks like PyTorch and TensorFlow.
  • Data can be queried and visualized easily from the web.

Deep Lake is highly suited to serve as the serverless memory that LLM chains and agents need for a variety of tasks, such as storing pertinent documents for question-answering or images to manage some guided image-generation tasks, thanks to its nature. Here is a diagram that illustrates this feature.

Let’s create an instance of a Deep Lake dataset.

from langchain.vectorstores import DeepLake

# Before executing the following code, make sure to have your
# Activeloop key saved in the “ACTIVELOOP_TOKEN” environment variable.

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "pranath"
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)
Your Deep Lake dataset has been successfully created!
Dataset(path='hub://pranath/langchain_course_indexers_retrievers', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (2, 1536)  float32   None   
    id        text      (2, 1)      str     None   
 metadata     json      (2, 1)      str     None   
   text       text      (2, 1)      str     None   
- 
['bd8b9dd6-39c8-11ee-8a93-acde48001122',
 'bd8b9f52-39c8-11ee-8a93-acde48001122']

In this example, we are expanding the dataset using text documents. Deep Lake is multimodal, therefore we might have specified an image embedder model in addition to adding photos to it. This could be helpful when looking for images that match a text search query or when using an image as a query.

The ability to store larger datasets in local memory becomes more difficult. Given that we are only uploading two documents in this instance, we might have easily used a nearby vector store. However, thousands or millions of documents might be used in a normal production situation and accessible from many programmes, necessitating the requirement for a centralised cloud dataset.

We then make a retriever.

# create retriever from db
retriever = db.as_retriever()

Once we have the retriever, we can start with question-answering.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="text-davinci-003"),
    chain_type="stuff",
    retriever=retriever
)

We can query our document that is an about specific topic that can be found in the documents.

query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)
 Google is offering developers access to its advanced AI language model, PaLM, via an API, along with a number of AI enterprise tools that can generate text, images, code, videos, audio, and more from simple natural language prompts. PaLM is a large language model, similar to the GPT series created by OpenAI, which Google hopes will help businesses carry out text generation and editing tasks.

5 What occurred behind the scenes?

In the beginning, we used a “stuff chain” (see CombineDocuments Chains). One method of providing information to the LLM is stuffing. We “stuff” all the information into the LLM’s prompt using this method. However, because the majority of LLMs have a context length restriction, this approach is only useful with shorter documents.

The embeddings are also used in a similarity search to find papers that match and can be used as context for the LLM. Even though it might not seem extremely beneficial with only one document, since we “chunked” our text, we are actually working with numerous documents. We may still stay inside the permitted context size by pre-selecting the most appropriate documents based on semantic similarity and provide the model with useful knowledge through the prompt.

Thus, via this investigation, we have learned how crucial indexes and retrievers are in enhancing the efficiency of large language models while processing document-based data.

By transforming documents and user queries into numerical vectors (embeddings) and storing them in specialised databases like Deep Lake, which serves as our vector store database, the system becomes more effective at discovering and presenting pertinent information.

The usefulness of this strategy in improving the general language understanding capabilities of LLMs is demonstrated by the retriever’s ability to locate documents in the embedding space that are closely connected to a user’s query.

6 A Potential Problem

The disadvantage of this approach is that when storing data, you might not know how to find the appropriate papers. In the Q&A example, we divided the content into equal halves so that when a user asks a question, both helpful and pointless text will appear.

It is bad to include irrelevant information in the LLM prompt because:

  1. It may cause the LLM to lose sight of important information.
  2. It takes up valuable space that may be used for information that is more pertinent.

7 Possible Solution

To solve this problem, a DocumentCompressor abstraction has been developed, enabling the use of compress_documents on the obtained documents.

In LangChain, the ContextualCompressionRetriever is a wrapper for another retriever. The base retriever’s retrieved documents are automatically compressed using a DocumentCompressor and a base retriever. This means that, in response to a certain query, only the most pertinent portions of the documents are delivered.

The LLMChainExtractor is a well-liked compressor option that employs an LLMChain to extract only the statements pertinent to the query from the documents. Utilising a ContextualCompressionRetriever and wrapping the base retriever with an LLMChainExtractor helps to improve the retrieval process. The LLMChainExtractor loops through the documents that were initially returned and only extracts the information that is pertinent to the query.

Here is an illustration of how to utilise LLMChainExtractor with ContextualCompressionRetriever:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# create GPT3 wrapper
llm = OpenAI(model="text-davinci-003", temperature=0)

# create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

Once we have created the compression_retriever, we can use it to retrieve the compressed relevant documents to a query.

# retrieving compressed documents
retrieved_docs = compression_retriever.get_relevant_documents(
    "How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)
/Users/pranathfernando/opt/anaconda3/lib/python3.9/site-packages/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
  warnings.warn(
Google is offering developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses “generate text, images, code, videos, audio, and more from simple natural language prompts.”

Compressors are designed to make it simple to communicate to the LLM only the pertinent data. By doing this, you may also provide the LLM with more information because, during the first retrieval stage, you can concentrate on recall (by, for example, increasing the amount of documents returned) and leave precision to the compressors:

8 Conclusion

For working with unstructured data and language models, LangChain’s indexes and retrievers provide modular, adaptable, and configurable solutions. They primarily concentrate on vector databases, while they offer only a limited amount of support for structured data.

Further Reading:

https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/

https://blog.langchain.dev/improving-document-retrieval-with-contextual-compression/

9 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe