Exploring Embeddings for Large Language Models

High-dimensional vectors called embeddings are used to store semantic data. Textual data can be transformed into embedding space by large language models, enabling flexible representations across languages. These embeddings act as useful tools for identifying relevant information.
natural-language-processing
deep-learning
langchain
activeloop
openai
retrievers
vectordb
Author

Pranath Fernando

Published

August 10, 2023

1 Introduction

The most fascinating and useful components of machine learning are vector embeddings, which are essential to many natural language processing, recommendation, and search algorithms. You have dealt with embedding-using systems if you have used voice assistants, recommendation engines, or translators.

Embeddings are semantically rich dense vector representations of data that are well suited for a variety of machine learning applications, including clustering, recommendation, and classification. They can be generated for a variety of data kinds, including text, photos, and audio, and they convert human-perceived semantic similarity into closeness in vector space.

For text data, vector embeddings of words, phrases, or paragraphs are produced using models from the GPT family of models and Llama. Convolutional neural networks (CNNs), like VGG and Inception, may produce embeddings for images. Using image embedding methods on spectrograms and other visual representations of audio frequencies, audio recordings can be turned into vectors. It is common practise to use deep neural networks to train models that transform objects into vectors. The resulting embeddings are frequently dense and high-dimensional.

In similarity search applications like KNN and ANN, where computing the distances between vectors to identify similarity is necessary, embeddings are widely utilised. For tasks like de-duplication, recommendations, anomaly detection, and reverse image search, nearest neighbour search can be used.

2 Similarity search and vector embeddings

GPT-3, a potent language model provided by OpenAI, can be used for a variety of tasks, including creating embeddings and running similarity searches. In this example, we’ll create embeddings for a collection of documents using the OpenAI API and then do a similarity search using cosine similarity.

import os
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings
import sys
sys.path.append('../..')
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

# Define the documents
documents = [
    "The cat is on the mat.",
    "There is a cat on the mat.",
    "The dog is in the yard.",
    "There is a dog in the yard.",
]

# Initialize the OpenAIEmbeddings instance
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Generate embeddings for the documents
document_embeddings = embeddings.embed_documents(documents)

# Perform a similarity search for a given query
query = "A cat is sitting on a mat."
query_embedding = embeddings.embed_query(query)

# Calculate similarity scores
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]

# Find the most similar document
most_similar_index = np.argmax(similarity_scores)
most_similar_document = documents[most_similar_index]

print(f"Most similar document to the query '{query}':")
print(most_similar_document)
Most similar document to the query 'A cat is sitting on a mat.':
The cat is on the mat.

By setting the OpenAI API key, we initialise the OpenAI API client. This enables us to generate embeddings using OpenAI’s services.

A list of documents is then defined as strings. We want to examine these documents’ text data for semantic similarity.

We must convert our texts into a format that our similarity computation algorithm can comprehend in order to conduct this analysis. The OpenAIEmbeddings class steps in at this point. We use it to create vectors that represent each document’s semantic information by creating embeddings for each one of them.

In a similar manner, we create an embedding from our query string. The text we want to identify the most comparable document to is in the query string.

We calculate the cosine similarity between the query embedding and each document embedding now that our documents and query have been transformed into embeddings. A metric for comparing the similarity of two vectors is the cosine similarity. It provides us with a list of similarity scores for each document in our situation for our query.

We then determine the document that most closely resembles our query using our similarity ratings. In order to accomplish this, we locate the document with the highest similarity score in our list of papers and get it.

Embedding vectors that are close to one another are thought to be comparable. Sometimes they are used directly to display related products in online stores. Instead of treating them as completely separate entities, they are sometimes merged into other models to share insights across related topics. For following model applications, this makes embeddings useful for expressing things like online browsing habits, text data, and e-commerce transactions.

3 Embedding Models

A machine learning model called an embedding model turns discrete input into continuous vectors. These discrete data points in the context of natural language processing can be words, sentences, or even whole publications. The resulting vectors, often referred to as embeddings, are made to preserve the original data’s semantic meaning.

For instance, words with similar semantic meanings, such as “cat” and “kitten,” would have comparable embeddings. These embeddings are dense, which means they capture subtle meaning variations by using numerous dimensions—often hundreds.

The main advantage of embeddings is that they make it possible to infer semantic meaning from mathematical procedures. To determine how semantically similar two embeddings are, for instance, we can compute the cosine similarity between them.

We selected the pre-trained “sentence-transformers/all-mpnet-base-v2” model for this challenge. The goal of this model is to turn sentences into embeddings, which are vectors that capture the semantic content of the phrases. Here, we utilise the model_kwargs parameter to specify that we want to do our calculations on the CPU.

from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

documents = ["Document 1", "Document 2", "Document 3"]
doc_embeddings = hf.embed_documents(documents)

After creating our model, we list the documents that will be used to create the semantic embeddings for our model.

We proceed to generate the embeddings after our model and documents are prepared. By using our list of documents as a parameter, we call the embed_documents function on our HuggingFaceEmbeddings object to accomplish this. Each document is processed by this method, which then produces a list of embeddings for each one.

Any subsequent tasks, such classification, clustering, or similarity analysis, can now use these embeddings. We can carry out intricate semantic tasks thanks to their representation of our source materials in a way that computers can comprehend and process.

4 Cohere embeddings

Cohere is committed to democratising cutting-edge NLP technology around the world by making its creative multilingual language models available to everyone. Their Multilingual Model substantially improves multilingual applications like search operations by turning text into a semantic vector space for greater text similarity understanding. The multilingual model performs better since it uses dot product computations, in contrast to their English language counterpart.

The 768-dimensional vector space used to express these multilingual embeddings.

One must obtain an API key in order to use the Cohere API’s power. Here’s how to accomplish it step-by-step:

  1. Visit the Cohere Dashboard
  2. If you haven’t already, you must either log in or sign up for a Cohere account. Please note that you agree to adhere to the Terms of Use and Privacy Policy by signing up.
  3. When you’re logged in, the dashboard provides an intuitive interface to create and manage your API keys.

We initialise an instance of the CohereEmbeddings class within LangChain using the “embed-multilingual-v2.0” model after we obtain the API key.

Then, a collection of documents in various languages is specified. Then, to create distinct embeddings for each text in the list, the embed_documents() method is used.

We print each text together with its matching embedding to show the findings. We simply show the first 5 dimensions of each embedding to keep things simple. You must also run the pip install cohere command to install the cohere package.

import cohere
from langchain.embeddings import CohereEmbeddings

# Initialize the CohereEmbeddings object
cohere = CohereEmbeddings(
    model="embed-multilingual-v2.0",
    cohere_api_key=os.environ['COHERE_API_KEY']
)

# Define a list of texts
texts = [
    "Hello from Cohere!", 
    "مرحبًا من كوهير!", 
    "Hallo von Cohere!",  
    "Bonjour de Cohere!", 
    "¡Hola desde Cohere!", 
    "Olá do Cohere!",  
    "Ciao da Cohere!", 
    "您好,来自 Cohere!", 
    "कोहेरे से नमस्ते!"
]

# Generate embeddings for the texts
document_embeddings = cohere.embed_documents(texts)

# Print the embeddings
for text, embedding in zip(texts, document_embeddings):
    print(f"Text: {text}")
    print(f"Embedding: {embedding[:5]}")  # print first 5 dimensions of each embedding
Text: Hello from Cohere!
Embedding: [0.23449707, 0.50097656, -0.04876709, 0.14001465, -0.1796875]
Text: مرحبًا من كوهير!
Embedding: [0.25341797, 0.30004883, 0.01083374, 0.12573242, -0.1821289]
Text: Hallo von Cohere!
Embedding: [0.10205078, 0.28320312, -0.0496521, 0.2364502, -0.0715332]
Text: Bonjour de Cohere!
Embedding: [0.15161133, 0.28222656, -0.057281494, 0.11743164, -0.044189453]
Text: ¡Hola desde Cohere!
Embedding: [0.25146484, 0.43139648, -0.08642578, 0.24682617, -0.117004395]
Text: Olá do Cohere!
Embedding: [0.18676758, 0.390625, -0.04550171, 0.14562988, -0.11230469]
Text: Ciao da Cohere!
Embedding: [0.11590576, 0.4333496, -0.025772095, 0.14538574, 0.0703125]
Text: 您好,来自 Cohere!
Embedding: [0.24645996, 0.3083496, -0.111816406, 0.26586914, -0.05102539]
Text: कोहेरे से नमस्ते!
Embedding: [0.19274902, 0.6352539, 0.031951904, 0.117370605, -0.26098633]

Cohere’s sophisticated language models are the perfect fit for LangChain, a complete library created for language comprehension and processing. This enables a wider range of applications, from semantic search to customer feedback analysis and content moderation, across a number of languages. It makes it easier to integrate Cohere’s multilingual embeddings into a developer’s workflow.

When combined with Cohere, LangChain removes the need for intricate pipelines, making the generation and manipulation of high-dimensional embeddings simple and effective. When linked to Cohere’s embedding endpoint and given a list of multilingual texts, the embed_documents() method in LangChain’s CohereEmbeddings class may quickly create distinct semantic embeddings for each text.

5 Deep Lake Vector Store

High-dimensional vectors can be efficiently stored and managed using vector stores, which are data structures or databases. They make it possible to do nearest neighbour searches, similarity searches, and other vector-related operations quickly. Different data structures, such as KD trees, Vantage Point trees, or approximate nearest neighbour (ANN) methods, can be used to build vector stores.

Deep Lake functions as a multi-modal vector store as well as a data lake for deep learning. As a multi-modal vector store, it allows users to store images, audio, videos, text, and metadata in a format optimized for deep learning. Users are able to search both embeddings and their properties thanks to hybrid search enabled by it.

Users can save data locally, on Activeloop storage, or on their personal cloud. With little boilerplate code, Deep Lake allows training TensorFlow and PyTorch models while streaming data. Additionally, it offers capabilities like distributed workloads, dataset queries, and version control utilising a straightforward Python API.

Furthermore, it gets harder and harder to store datasets in local memory as they grow in size. Since there aren’t many documents being uploaded, a nearby vector store could have been used in this case. However, in a normal production environment, where thousands or millions of documents may be involved and accessible by numerous programmes, the need for a centralised cloud dataset becomes apparent.

Let’s use Deep Lake as an example to understand how this works.

6 Creating Deep Lake Vector Store embeddings example

We can follow the example for creating a vector store in addition to other examples for which Deep Lake has supplied Jupyter Notebooks to its well-written instructions.

In order to create and manage high-dimensional embeddings, this effort attempts to make use of the capabilities of NLP technologies, specifically OpenAI and Deep Lake. These embeddings can be used for a number of tasks, including looking up pertinent papers, editing content, and responding to inquiries. In this instance, a retrieval-based question-answering system will be built using a Deep Lake database.

First, we must import the necessary packages and make sure that the environment variables ACTIVELOOP_TOKEN and OPENAI_API_KEY have the Activeloop and OpenAI keys. Obtaining an ACTIVELOOP_TOKEN is simple; just generate one on the Activeloop page.

Next, the necessary modules from the langchain package are imported.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

We then create some documents using the RecursiveCharacterTextSplitter class.

# create our documents
texts = [
    "Napoleon Bonaparte was born in 15 August 1769",
    "Louis XIV was born in 5 September 1638",
    "Lady Gaga was born in 28 March 1986",
    "Michael Jeffrey Jordan was born in 17 February 1963"
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.create_documents(texts)

The next step is to create a Deep Lake database and load our documents into it.

# initialize embeddings model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "pranath"
my_activeloop_dataset_name = "langchain_course_embeddings"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)
Your Deep Lake dataset has been successfully created!
Dataset(path='hub://pranath/langchain_course_embeddings', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  (4, 1536)  float32   None   
    id        text      (4, 1)      str     None   
 metadata     json      (4, 1)      str     None   
   text       text      (4, 1)      str     None   
 
['71f0cb00-3c5a-11ee-beac-acde48001122',
 '71f0cc86-3c5a-11ee-beac-acde48001122',
 '71f0cd08-3c5a-11ee-beac-acde48001122',
 '71f0cd6c-3c5a-11ee-beac-acde48001122']

We now create a retriever from the database.

# create retriever from db
retriever = db.as_retriever()

Finally, we create a RetrievalQA chain in LangChain and run it

# istantiate the llm wrapper
model = ChatOpenAI(model='gpt-3.5-turbo')

# create the question-answering chain
qa_chain = RetrievalQA.from_llm(model, retriever=retriever)

# ask a question to the chain
qa_chain.run("When was Michael Jordan born?")
'Michael Jordan was born on 17 February 1963.'

This pipeline demonstrates how to leverage the power of the LangChain, OpenAI, and Deep Lake libraries and products to create a conversational AI model capable of retrieving and answering questions based on the content of a given repository.

Let’s break down each step to understand how these technologies work together.

  1. OpenAI and LangChain Integration: LangChain, a library built for chaining NLP models, is designed to work seamlessly with OpenAI’s GPT-3.5-turbo model for language understanding and generation. You’ve initialized OpenAI embeddings using OpenAIEmbeddings(), and these embeddings are later used to transform the text into a high-dimensional vector representation. This vector representation captures the semantic essence of the text and is essential for information retrieval tasks.
  2. Deep Lake: Deep Lake is a Vector Store for creating, storing, and querying vector representations (also known as embeddings) of data.
  3. Text Retrieval: Using the db.as_retriever() function, you’ve transformed the Deep Lake dataset into a retriever object. This object is designed to fetch the most relevant pieces of text from the dataset based on the semantic similarity of their embeddings.
  4. Question Answering: The final step involves setting up a RetrievalQA chain from LangChain. This chain is designed to accept a natural language question, transform it into an embedding, retrieve the most relevant document chunks from the Deep Lake dataset, and generate a natural language answer. The ChatOpenAI model, which is the underlying model of this chain, is responsible for both the question embedding and the answer generation.

7 Conclusion

The rich contextual information in our textual data can be captured and understood in large part thanks to vector embeddings. When dealing with low token capacity language models like GPT-3.5-turbo, this representation assumes increasing importance.

In addition to using embeddings from Hugging Face and Cohere, we used embeddings from OpenAI in this post. Transformer-based models are offered by the former, a renowned AI research organisation, and are very flexible and popular. In an increasingly linked world, Cohere’s revolutionary multilingual language models are a great benefit.

We’ve walked through the process of developing a conversational AI application, specifically a Q&A system utilising Deep Lake, based on these technologies. This application shows the potential of these merged technologies: Hugging Face, Cohere, OpenAI, and Hugging Face for producing high-quality embeddings; Deep Lake for keeping these embeddings in a vector store; and LangChain for chaining together complex NLP operations.

8 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe