LivingDataLab - Advanced Vectorstore Retrieval using LangChain

1 Introduction

In this article we look at how you can retrieve content from a vectorstore using state-of-the-art methods to ensure only the most relevant content is made available for Large Language Models.

These methods include:

Maximum Marginal Relevance
Metadata
Metadata using a Self-Query retriever
Compression
Combining all the above

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

In previous articles we covered the basics of semantic search and saw that it worked pretty well for a good amount of use cases. But we also saw some edge cases and saw how things could go a little bit wrong. In this article, we’re going to deep dive on retrieval and cover a few more advanced methods for overcoming those edge cases.

2 Load Libs & Setup

import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

!pip install lark

Requirement already satisfied: lark in /Users/pranathfernando/opt/anaconda3/lib/python3.9/site-packages (1.1.7)

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

3 Addressing Diversity: Maximum marginal relevance

Semantic similarity search was discussed in a previous article but in this article we’ll discuss a few other, more sophisticated approaches. The first one we’re going to cover is Maximum Marginal Relevance, or MMR.

In my previous articles we introduced one problem: how to enforce diversity in the search results?

Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

The reasoning behind this is that, as we observed in one of the edge instances, if you always choose the documents in the embedding space that are most comparable to the query, you might actually miss out on different information.

Lets setup our embeddings and vector database.

embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

print(vectordb._collection.count())

In this instance, a chef is inquiring on only white mushrooms. The first two documents, which have a lot of material that is comparable to the query concerning a fruiting body and being entirely white, would thus be the ones with the most similar results when we look at the search results. But we really want to make sure that we also learn other things, like how dangerous it is.

texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

We’ll make a modest database for this example so we can use it purely as a plaything. With our question in hand, we can now perform a similarity search. We’ll set “k=2” to only show the top two papers. Furthermore, it is clear that the fact that it is dangerous is not mentioned. Let’s try it now using MMR. While “k=2” is being sent, we still want to return two documents, but let’s set “fetch_k=3,” where we initially fetched all three documents.

smalldb = Chroma.from_texts(texts, embedding=embedding)

question = "Tell me about all-white mushrooms with large fruiting bodies"

smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

Let’s revisit an instance from a previous article in which we requested information on MATLAB and received documents that contained redundant data. In order to refresh your memory, we can look at the first two documents and verify that they are identical by focusing only on the first few characters because the rest of the documents are lengthy. Since the first result is the most comparable to the previous one, we can see that it is the same when we run MMR on these results.

question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

Note the difference in results with MMR.

docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

docs_mmr[0].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

docs_mmr[1].page_content[:100]

"mathematical work, he feels like he's disc overing truth and beauty in the universe. And \nhe says it"

This is when using MMR is useful because it will choose from a wide variety of documents. The principle behind MMR is that after sending a query, we first receive a series of responses, with the number of responses we receive ultimately being controlled by a parameter called “fetch_k”. The only basis for this is semantic similarity. Then, using that smaller group of texts, we try to optimise for both the diverse and the most relevant materials based on semantic similarity. And we select a final “k” to give the user from that collection of papers.

You can read more about MMR in this langchain documentation and this medium article

4 Addressing Specificity: Working with Metadata

In a previous article we showed that a question about the third lecture can include results from other lectures as well. To address this, many vectorstores support operations on metadata which provides context for each embedded chunk.

What we can do is divide the original query into two distinct components, a filter and a search term, using a language model itself. A metadata filter is supported by most vector storage. As a result, you may simply filter records according on metadata, such as 1980. In our case, we will filter by document as we only want to include context that relates to the third lecture pdf.

question = "what did they say about regression in the third lecture?"

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/MachineLearning-Lecture03.pdf"}
)

for d in docs:
    print(d.metadata)

{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 4}

5 Addressing Specificity: Working with Metadata using a Self-Query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use SelfQueryRetriever, which uses an LLM to extract:

The query string to use for vector search
A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn’t require any new databases or indexes.

Now let’s move on to the example of the self-query. What was said concerning regression in the third lecture was a concern of ours during this one. And it also included the first and second lectures in the results it returned. If we were manually repairing this, we would choose a metadata filter. Therefore, we would include this information that we want the source to be the same as the PDF of the third lecture. The documents that would then be found would all be from that particular lecture, if we were to look at them.

We don’t need to manually describe that because a language model can take care of it for us. We’ll import the OpenAI language model to accomplish this. A retriever called the self-query retriever will be imported, and after that, attribute info, which allows us to identify specific fields in the metadata and what they correspond to, will be imported. The metadata only contains the fields source and page. For each of these attributes, we fill out a description of the name, the description, and the type. Making this information as descriptive as feasible is crucial because it will really be transmitted to the language model. Next, we’ll provide some details regarding the contents of this document store.

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/MachineLearning-Lecture01.pdf`, `docs/MachineLearning-Lecture02.pdf`, or `docs/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

We will initialise the language model, and then we will initialise the self-query retriever using the “from_llm” method and passing in the language model, the underlying vector database that we’re going to query, the details regarding the description and the metadata, and finally we will pass in “verbose=True.”

When “verbose=True” is set, we can see the inner workings of how the LLM determines which query should be passed along with any metadata filters. Because “verbose=True” is set when the self-query retriever is used with this query, we can see that we are printing out the inside workings of the query.

The semantic portion of the process starts with a regression query, which is followed by a filter with a comparator of equals between the source attribute and a value of documents, and finally this path, which leads to the third machine learning lesson. Thus, this essentially instructs us to perform a regression lookup in the semantic space and then a filter where we only consider articles that contain a source value for this value. Therefore, we should be able to tell that they are all from this third lecture if we loop through the papers and print the metadata.

document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

question = "what did they say about regression in the third lecture?"

You will receive a warning about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

docs = retriever.get_relevant_documents(question)

/Users/pranathfernando/opt/anaconda3/lib/python3.9/site-packages/langchain/chains/llm.py:275: UserWarning: The predict_and_parse method is deprecated, instead pass an output parser directly to LLMChain.
  warnings.warn(

query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/MachineLearning-Lecture03.pdf') limit=None

for d in docs:
    print(d.metadata)

{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 10}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 10}

You can read more about using Metadata and a Self-Query retiever in this langchain documentation that uses Pinecone as well as this medium article that uses PaLM.

This is also a good article from Pinecone describing the filtering issue with vector stores and metadata

6 Additional tricks: Compression

Contextual compression is the final retrieval method we will discuss. So let’s load the contextual compression retriever and then an LLM chain extractor into this location.

Only the key information from each document will be extracted, and that information will be sent as the final return result. Since documents are frequently lengthy and confusing, we’ll build a little function to beautifully print them out so that it’s easier to understand what’s going on.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses. Contextual compression is meant to fix this.

With compression, you can then run all those documents through a language model and extract the most relevant segments and then pass only the most relevant segments into a final language model call. This comes at the cost of making more calls to the language model, but it’s also really good for focusing the final answer on only the most important things. And so it’s a bit of a tradeoff. Let’s see these different techniques in action. We’re going to start by loading the environment variables as we always do.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

Subsequently, using the LLM chain extractor, we can make a compressor. Then, using the compressor and the vector store’s base retriever, we can develop a contextual compression retriever.

If we look at the documents we receive back after passing in the question, “What did they say about MATLAB,” and the compressed documents, we can observe two things. One is that they are much shorter than typical documents. However, since we’re using the semantic search technique behind the scenes, there is still some of this duplicated content there.

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 3:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
----------------------------------------------------------------------------------------------------
Document 4:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."

You can read more about contextual compression from this langchain documentation

7 Combining various techniques

Here is an illustration of how you might mix several strategies to achieve the greatest outcomes. We can set the search type to MMR in order to accomplish that while building the retriever from the vector database, also using contextual compression. Then we can execute this and observe that we receive a filtered set of results devoid of any duplicate data. Up until now, a vector database has served as the foundation for all extra retrieval methods that we have outlined. It’s important to note that there are other retrieval methods that employ more conventional NLP approaches rather than vector databases at all.

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."

8 Other Types of Retrieval

It’s worth noting that vectordb as not the only kind of tool to retrieve documents. The LangChain retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

An SVM retriever and a TF-IDF retriever will be used to recreate a retrieval pipeline here. It’s good if you are familiar with this terminology from conventional NLP or conventional machine learning. This is only one of many different methods that are available.

from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDF
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)

# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends  or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays.   [End of Audio]  \nDuration: 69 minutes", metadata={})

We can complete the standard pipeline of loading and dividing rather quickly. The TF-IDF retriever just accepts the splits directly, but the SVM retriever also accepts an embedding module. We can now make use of the other retrievers. Let’s give the SVM retriever some information on what others have said about MATLAB. When we look at the top document that it returns, we can see that it makes numerous references to MATLAB, indicating that it is pulling up some solid findings in that area. We can test this out on the TF-IDF retriever as well and find that the outcomes are a little bit less favourable.

question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first step. They were able to.  \nI'll just show you one more example. I like this  because it's a picture of Stanford with our \nbeautiful Stanford campus. So again, taking th e same sort of clustering algorithms, taking \nthe same sort of unsupervised learning algor ithm, you can group the pixels into different \nregions. And using that as a pre-processing step, they eventually built this sort of 3D model of Stanford campus in a single picture.  You can sort of walk  into the ceiling, look", metadata={})

You’ll probably see that some of these approaches are superior to others in a number of different situations.

The self query retriever seems especially neat. So I would advise doing that with increasingly complicated metadata filters. Even if you make up some metadata, you could try to trick the LLM into thinking it has nested metadata structures.

9 Acknowledgements

I’d like to express my thanks to the wonderful LangChain: Chat with your data course by DeepLearning.ai and LangChain - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Advanced Vectorstore Retrieval using LangChain

Subscribe