The Activeloop Deep Lake Vector Store for Agents & Large Language Models

Activeloop Deep Lake provides storage for embeddings and their corresponding metadata in the context of LLM apps, and enables hybrid searches on these embeddings and their attributes for efficient data retrieval and integrates with LangChain and Agents
natural-language-processing
agents
langchain
activeloop
openai
Author

Pranath Fernando

Published

July 27, 2023

1 Introduction

Activeloop Deep Lake provides storage for embeddings and their corresponding metadata in the context of LLM apps . It enables hybrid searches on these embeddings and their attributes for efficient data retrieval. It also integrates with LangChain & Agents, facilitating the development and deployment of applications.

2 Deeplake v Other Vector Stores

Deep Lake provides several advantages over the typical vector store:

  • It’s multimodal, which means that it can be used to store items of diverse modalities, such as texts, images, audio, and video, along with their vector representations.
  • It’s serverless, which means that we can create and manage cloud datasets without creating and managing a database instance. This aspect gives a great speedup to new projects.
  • Last, it’s possible to easily create a data loader out of the data loaded into a Deep Lake dataset. It is convenient for fine-tuning machine learning models using common frameworks like PyTorch and TensorFlow.

In order to use Deep Lake, you first have to register on the Activeloop website and redeem your API token. Here are the steps for doing it:

  1. Sign up for an account on Activeloop’s platform. You can sign up at Activeloop’s website. After specifying your username, click on the “Sign up” button. You should now see your homepage.
  2. You should now see a “Create API token” button at the top of your homepage. Click on it, and you’ll get redirected to the “API tokens” page. This is where you can generate, manage, and revoke your API keys for accessing Deep Lake.
  3. Click on the “Create API token” button. Then, you should see a popup asking for a token name and an expiration date. By default, the token expiration date is set so that the token expires after one day from its creation, but you can set it further in the future if you want to keep using the same token for the whole duration of the course. Once you’ve set the token name and its expiration date, click on the “Create API token” button.
  4. You should now see a green banner saying that the token has been successfully generated, along with your new API token, on the “API tokens” page. To copy your token to your clipboard, click on the square icon on its right.

Now that you have your API token, you can conveniently store under the ACTIVELOOP_TOKEN key in the environment variable to retrieve it automatically by the Deep Lake libraries whenever needed.

Let’s demonsrate how it can be used.

3 Import Libs & Setup

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
# We have loaded the environment vars using a .env file and have assigned os.environ["ACTIVELOOP_TOKEN"]

4 Basic Deeplake Demo

Lets demonstrate how we can use the Deeplake vector store. We will use Langchain as well as an OpenAI GPT-3.5 model as our LLM stack. We will set up a simple vector store with some birthdays, create an LLM based agent then ask a question about one of the birthdays - which will require the agent to find the details in the Deeplake.

Let’s first set up the Deeplake vector store and LLM.

# instantiate the LLM and embeddings models
llm = OpenAI(model="text-davinci-003", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

# Create Deep Lake dataset
# Use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "pranath" 
my_activeloop_dataset_name = "langchain_course_from_zero_to_hero"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)
Your Deep Lake dataset has been successfully created!
Dataset(path='hub://pranath/langchain_course_from_zero_to_hero', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (2, 1536)  float32   None   
    id        text      (2, 1)      str     None   
 metadata     json      (2, 1)      str     None   
   text       text      (2, 1)      str     None   
/ 
['d9f49eb8-354b-11ee-9eb0-acde48001122',
 'd9f4a034-354b-11ee-9eb0-acde48001122']

Now, let’s create a Langchain RetrievalQA chain:

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever()
)

Next, let’s create an agent that uses the RetrievalQA chain as a tool:

tools = [
    Tool(
        name="Retrieval QA System",
        func=retrieval_qa.run,
        description="Useful for answering questions."
    ),
]

agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

Finally, we can use the agent to ask a question:

response = agent.run("When was Napoleone born?")
print(response)


> Entering new  chain...
 I need to find out when Napoleone was born.
Action: Retrieval QA System
Action Input: When was Napoleone born?
Observation:  Napoleon Bonaparte was born on 15 August 1769.
Thought: I now know the final answer.
Final Answer: Napoleon Bonaparte was born on 15 August 1769.

> Finished chain.
Napoleon Bonaparte was born on 15 August 1769.

Here, the agent used the “Retrieval QA System” tool with the query “When was Napoleone born?” which is then run on our new Deep Lake dataset, returning the most similar document (i.e., the document containing the date of birth of Napoleon). This document is eventually used to generate the final output.

Note the Agent also made use of the ReaCT framework for LLM prompt structuring.

This example shows how to utilise Deep Lake as a vector database and to develop an agent that uses a RetrievalQA chain as a tool to respond to queries depending on the provided content.

5 Adding more Data and Reloading Deeplake

Let’s add a case where more data is added and an existing vector storage is reloaded.

We first reload a vector store from Deep Lake that is already there and is situated at a specific dataset path. After that, we import fresh text data and divide it into manageable portions. Last but not least, we include these chunks into the current dataset by producing and archiving matching embeddings for each additional text segment:

# load the existing Deep Lake dataset and specify the embedding function
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# create new documents
texts = [
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

# add documents to our Deep Lake dataset
db.add_documents(docs)
Deep Lake Dataset in hub://pranath/langchain_course_from_zero_to_hero already exists, loading from the storage
Dataset(path='hub://pranath/langchain_course_from_zero_to_hero', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (4, 1536)  float32   None   
    id        text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
   text       text      (4, 1)      str     None   
\ 
['b7931762-354d-11ee-9eb0-acde48001122',
 'b79318e8-354d-11ee-9eb0-acde48001122']

Then, we replicate our prior agent and pose a query that can only be addressed by the most recent documents added.

# instantiate the wrapper class for GPT3
llm = OpenAI(model="text-davinci-003", temperature=0)

# create a retriever from the db
retrieval_qa = RetrievalQA.from_chain_type(
 llm=llm, chain_type="stuff", retriever=db.as_retriever()
)

# instantiate a tool that uses the retriever
tools = [
    Tool(
        name="Retrieval QA System",
        func=retrieval_qa.run,
        description="Useful for answering questions."
    ),
]

# create an agent that uses the tool
agent = initialize_agent(
 tools,
 llm,
 agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
 verbose=True
)

Let’s now test our agent with a new question.

response = agent.run("When was Michael Jordan born?")
print(response)


> Entering new  chain...
 I need to find out when Michael Jordan was born.
Action: Retrieval QA System
Action Input: When was Michael Jordan born?
Observation:  Michael Jordan was born on 17 February 1963.
Thought: I now know the final answer.
Final Answer: Michael Jordan was born on 17 February 1963.

> Finished chain.
Michael Jordan was born on 17 February 1963.

6 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe