Vectorstores and Embeddings with LangChain

In this article we look at how to convert documents into vector stores an embeddings as an important step in making content available for Large Language Models.
natural-language-processing
deep-learning
langchain
Author

Pranath Fernando

Published

July 22, 2023

1 Introduction

In this article we look at how to convert documents into vector stores an embeddings as an important step in making content available for Large Language Models.

2 Vectorstores and Embeddings

If our document has been divided into manageable, semantically meaningful parts, we need to index these chunks so we can quickly retrieve them when we need to respond to inquiries about this corpus of information. We’ll use vector storage and embeddings to accomplish it. Let’s find out what they are.

First off, these are crucial for creating chatbots using your data. Second, we’ll delve a little deeper and discuss edge cases, when this general approach may really fall short.

Recall the overall workflow for retrieval augmented generation (RAG):

3 Load Libs & Setup

A few documents will be loaded at this point. After the documents have loaded, chunks can be made using the recursive character text splitter. It is evident that we have now produced more than 200 distinct chunks. These embeddings will be produced using OpenAI.

import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

We discussed Document Loading and Splitting in a previous article.

from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
len(splits)
209

4 Embeddings

What exactly are embeddings? A numerical representation of a text is made using the text as the source. Similar vectors will exist in this numerical space for texts with similar content. By comparing those vectors, we may identify text passages that are comparable. Therefore, it is clear from the example below that two statements about pets are quite similar, but not as similar as a sentence about a pet and a sentence about the weather.

from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)
import numpy as np
np.dot(embedding1, embedding2)
0.9631853877103518
np.dot(embedding1, embedding3)
0.7709997651294672
np.dot(embedding2, embedding3)
0.7596334120325523

Recalling the entire end-to-end workflow, we begin with documents, divide them into smaller chunks, embed those chunks in other documents, and then store everything in a vector store. A database where you can quickly seek for related vectors later on is called a vector store. This will be helpful when we are looking for materials that are relevant to the current issue. Then, using an embedding of the current problem, we may compare it to every vector in the vector store and choose the one that is most comparable to the original problem.

Then, after selecting the n pieces that are the most similar, we submit the query and those chunks to an LLM to receive an answer. Later, we’ll talk more about all of it. It’s time to focus on vector storage and embeddings themselves for the time being.

Here, we can see that the first two embeddings have a relatively good score of 0.96. When we compare the first embedding to the third one, we can observe that it is substantially lower at 0.77. And if we compare the second to the third, we can see that the value is roughly the same at 0.76.

5 Vectorstores

It’s time to build embeddings for every PDF chunk of an example document and then keep them all together in a vector store.

We’ll utilise Chroma as our vector store for this. Let’s import that, then. LangChain has integrations with a large number of vector stores—more than 30 in total. We select Chroma because it is portable and memory-based, making it simple to set up and operate. When trying to persist huge volumes of data or persist it in a cloud storage location, there are different vector stores that provide hosted solutions.

So, let’s make a variable named persist directory that we will utilise at docs slash Chroma later on. Additionally, let’s check to see if anything is already present. It can throw things off if there is already material there, and we don’t want that to happen. To check sure there is nothing there, let’s RM dash RF documents dot Chroma. Now let’s build the vector store. As a result, we call Chroma from documents passing in splits; these splits were originally built with embedding passed in.

# ! pip install chromadb
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
print(vectordb._collection.count())
209

And this is the open AI embedding model. We can store the directory to disc by supplying in the persist directory keyword argument, which is unique to Chroma. After performing this, we can see that the collection count is 209, which is exactly the same as the number of divides we had previously.

7 Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily, but there are some failure modes that can creep up. Here are some edge cases that can arise.

Let’s try a different query. What were their comments about MATLAB? Let’s run this with K equal to 5 and see what happens. The first two findings are actually identical, as can be seen by looking at them. This is due to the fact that, as you may recall, we purposefully specified a duplicate item when we loaded in the PDFs. This is problematic since we will later provide both of these chunks to the language model and we have the same information in two different forms. The second bit of information has little real value, and it would be much better if the language model could learn from a different, more distinct item of data.

question = "what did they say about matlab?"
docs = vectordb.similarity_search(question,k=5)

Notice that we’re getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

docs[0] and docs[1] are indentical.

docs[0]
Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will work for just about \neverything.  \nSo actually I, well, so yeah, just a side comment for those of you that haven\'t seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He\'s taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 8})
docs[1]
Document(page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will work for just about \neverything.  \nSo actually I, well, so yeah, just a side comment for those of you that haven\'t seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He\'s taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your', metadata={'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 8})

Another type of failure mode is also conceivable. What was said in the third lecture document concerning regression, then? is our new query. Intuitively, we would anticipate that all of these documents would be included in the third lesson when we receive them.

The metadata we have about the lectures they were taken from allows us to verify this. So let’s iterate through each page and print the info. We can see that the outcomes are actually a mix of those from the first lecture, the second lecture, and the third lecture.

question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)
for doc in docs:
    print(doc.metadata)
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': 'docs/MachineLearning-Lecture02.pdf', 'page': 0}
{'source': 'docs/MachineLearning-Lecture03.pdf', 'page': 6}
{'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 8}
print(docs[4].page_content)
into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's help ed me make lots of money, and here's a 
picture of my big house."  
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this 
machine learning stuff was actually useful. So what was it that you learned? Was it 
logistic regression? Was it the PCA? Was it the data ne tworks? What was it that you 
learned that was so helpful?" And the student said, "Oh, it was the MATLAB."  
So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, 
and we'll actually have a short MATLAB tutori al in one of the discussion sections for 
those of you that don't know it.  
Okay. The very last piece of logistical th ing is the discussion s ections. So discussion 
sections will be taught by the TAs, and atte ndance at discussion sections is optional, 
although they'll also be recorded and televi sed. And we'll use the discussion sections 
mainly for two things. For the next two or th ree weeks, we'll use the discussion sections 
to go over the prerequisites to this class or if some of you haven't seen probability or 
statistics for a while or maybe algebra, we'll go over those in the discussion sections as a 
refresher for those of you that want one.

The third lecture and the fact that we only want documents from the third lecture are both pieces of structured information, but we’re only using embeddings to perform a semantic lookup, which embeds the entire sentence and is probably a little more focused on regression. As a result, we are receiving findings that are presumably quite relevant to regression, and if we look at the fifth document, which is the one from the first lecture, we can see that regression is in fact mentioned there.

Since it is a piece of structured information that isn’t really completely represented in our semantic embedding, it is catching up on that, but it isn’t picking up on the fact that it should only be querying papers from the third lecture.

This is a case where we might actually want to do some kind of pre-filtering on our embeddings, for example to somehow prefilter only the embeddings from the third lecture document. This is possible using richer metadata and indexes for this metadata, which I will look at in the next article.

8 Acknowledgements

I’d like to express my thanks to the wonderful LangChain: Chat with your data course by DeepLearning.ai and LangChain - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe