LivingDataLab - Creating a Voice Assistant for your Knowledge Base

1 Introduction

Here we plan to build a voice assistant for a knowledge base. This post will explain how to create your own voice assistant using cutting-edge artificial intelligence tools. OpenAI’s Whisper, a sophisticated automatic speech recognition (ASR) algorithm, is used by the voice assistant. Whisper efficiently converts our vocal inputs to text. After we’ve transcribed our speech inputs into text, we’ll focus on creating voice outputs. We use Eleven Labs to accomplish this, which allows the voice assistant to reply to users in an engaging and natural manner.

The project’s heart is centred on a reliable question-answering mechanism. This procedure begins by loading the vector database, a repository containing several documents relevant to our possible searches. When a question is posed, the system pulls the papers from this database and feeds them to the LLM along with the question. The LLM then generates the response depending on the documents that have been retrieved.

We want to build a voice assistant that can rapidly navigate a knowledge base and provide precise and timely solutions to a user’s inquiries. We’re going to use GitHub’s ‘JarvisBase’ repository for this experiment.

2 Import Libs & Setup

We’d begin by installing the prerequisites. These are the libraries that will be required. While we strongly advise installing the most recent versions of these packages, please keep in mind that the codes have only been tested using the versions shown in brackets.

langchain==0.0.208 deeplake==3.6.5 openai==0.27.8 tiktoken==0.4.0 elevenlabs==0.2.18 streamlit==1.23.1 beautifulsoup4==4.11.2 audio-recorder-streamlit==0.0.8 streamlit-chat==0.0.2.2

For this experiment, you’d need to obtain several API keys and tokens. They need to be set in the environment variable as described below.

import os

os.environ['OPENAI_API_KEY']='<your-openai-api-key>'
os.environ['ELEVEN_API_KEY']='<your-eleven-api-key>'
os.environ['ACTIVELOOP_TOKEN']='<your-activeloop-token>'

To access OpenAI’s services, you must first obtain credentials by signing up on their website, completing the registration process, and creating an API key from your dashboard. This enables you to leverage OpenAI’s powerful capabilities in your projects.

If you don’t have an account yet, create one by going to https://platform.openai.com/. If you already have an account, skip to step 5. Fill out the registration form with your name, email address, and desired password. OpenAI will send you a confirmation email with a link. Click on the link to confirm your account. Please note that you’ll need to verify your email account and provide a phone number for verification. Log in to https://platform.openai.com/. Navigate to the API key section at https://platform.openai.com/account/api-keys. Click “Create new secret key” and give the key a recognizable name or ID.

To get the ELEVEN_API_KEY, follow these steps:

Go to https://elevenlabs.io/ and click on “Sign Up” to create an account.
Once you have created an account, log in and navigate to the “API” section.
Click the “Create API key” button and follow the prompts to generate a new API key.
Copy the API key and paste it into your code where it says “your-eleven-api-key” in the ELEVEN_API_KEY variable.

For ACTIVELOOP TOKEN, follow these easy steps:

Go to https://www.activeloop.ai/ and click on “Sign Up” to create an account. 2. Once you have an Activeloop account, you can create tokens in the Deep Lake App (Organization Details -> API Tokens)

Click the “Create API key” button and generate a new API Token.

Copy the API key and paste it as your environment variable: ACTIVELOOP_TOKEN=‘your-Activeloop-token’

3 Sourcing Content from Hugging Face Hub

Now that everything is in place, let’s start by gathering all Python library articles from the Hugging Face Hub, an open platform for sharing, collaborating, and progressing in machine learning. These articles will serve as our voice assistant’s knowledge base. We’ll do some web scraping to gather some knowledge docs.

Let’s look at and run the script.py file (python scrape.py). This script contains all of the code included in the “Sourcing Content from Hugging Face Hub” and “Embedding and Storing in Deep Lake” sections of this tutorial. You can run the files by forking or downloading the given repository.

We begin by importing the required modules, loading environment variables, and establishing the path for Deep Lake, a vector database. It also creates an instance of OpenAIEmbeddings, which will be used later to embed the scraped articles:

import os
import requests
from bs4 import BeautifulSoup
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
import re

# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "langchain_course_jarvis_assistant"
dataset_path= 'hub://{active_loop_username}/{dataset_name}'

embeddings =  OpenAIEmbeddings(model_name="text-embedding-ada-002")

We begin by compiling a list of relative URLs that lead to knowledge documents hosted on the Hugging Face Hub. To accomplish this, we define the function get_documentation_urls(). We then attach these relative URLs to the base URL of the Hugging Face Hub using another function, construct_full_url(), to create full URLs that we may access directly.

def get_documentation_urls():
    # List of relative URLs for Hugging Face documentation pages, commented a lot of these because it would take too long to scrape all of them
    return [
            '/docs/huggingface_hub/guides/overview',
            '/docs/huggingface_hub/guides/download',
            '/docs/huggingface_hub/guides/upload',
            '/docs/huggingface_hub/guides/hf_file_system',
            '/docs/huggingface_hub/guides/repository',
            '/docs/huggingface_hub/guides/search',
            # You may add additional URLs here or replace all of them
    ]

def construct_full_url(base_url, relative_url):
    # Construct the full URL by appending the relative URL to the base URL
    return base_url + relative_url

The script then aggregates all of the URL collected content. The scrape_all_content() function accomplishes this by recursively calling scrape_page_content() for each URL and extracting its text. The text that has been gathered is subsequently stored to a file.

def scrape_page_content(url):
    # Send a GET request to the URL and parse the HTML response using BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract the desired content from the page (in this case, the body text)
    text=soup.body.text.strip()
    # Remove non-ASCII characters
    text = re.sub(r'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\xff]', '', text)
    # Remove extra whitespace and newlines
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def scrape_all_content(base_url, relative_urls, filename):
    # Loop through the list of URLs, scrape content and add it to the content list
    content = []
    for relative_url in relative_urls:
        full_url = construct_full_url(base_url, relative_url)
        scraped_content = scrape_page_content(full_url)
        content.append(scraped_content.rstrip('\n'))

    # Write the scraped content to a file
    with open(filename, 'w', encoding='utf-8') as file:
        for item in content:
            file.write("%s\n" % item)
    
    return content

4 Loading and splitting texts

We load the information from the file and break it into distinct documents using the load_docs() function to prepare the collected text for embedding into our vector database. We break the content into individual chunks to further enhance it using split_docs(). We’d see a Text loader and a text_splitter in action here.

Character = instructiontext_splitterTextSplitter(chunk_size=1000, chunk_overlap=0) produces a text splitter object that divides the text into pieces based on characters. Each document in documents is divided into portions of about 1000 characters, with no overlap between them.

# Define a function to load documents from a file
def load_docs(root_dir,filename):
    # Create an empty list to hold the documents
    docs = []
    try:
        # Load the file using the TextLoader class and UTF-8 encoding
        loader = TextLoader(os.path.join(
            root_dir, filename), encoding='utf-8')
        # Split the loaded file into separate documents and add them to the list of documents
        docs.extend(loader.load_and_split())
    except Exception as e:
        # If an error occurs during loading, ignore it and return an empty list of documents
        pass
    # Return the list of documents
    return docs
  
def split_docs(docs):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    return text_splitter.split_documents(docs)

5 Embedding and storing in Deep Lake

After gathering the required articles, the next step is to embed them using Deep Lake. Deep Lake is an effective tool for developing searchable vector databases. It will allow us to efficiently index and retrieve the information included in our Python library articles in this context.

Finally, we can begin populating our vector database.

The Deep Lake integration creates a database instance using the specified dataset path and the OpenAIEmbeddings function. OpenAIEmbeddings is transforming the text chunks into embedding vectors, a vector database-compatible format. The.add_documents method will parse the texts and save them in the database.

# Define the main function
def main():
    base_url = 'https://huggingface.co'
    # Set the name of the file to which the scraped content will be saved
    filename='content.txt'
    # Set the root directory where the content file will be saved
    root_dir ='./'
    relative_urls = get_documentation_urls()
    # Scrape all the content from the relative URLs and save it to the content file
    content = scrape_all_content(base_url, relative_urls,filename)
    # Load the content from the file
    docs = load_docs(root_dir,filename)
    # Split the content into individual documents
    texts = split_docs(docs)
    # Create a DeepLake database with the given dataset path and embedding function
    db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
    # Add the individual documents to the database
    db.add_documents(texts)
    # Clean up by deleting the content file
    os.remove(filename)

# Call the main function if this script is being run as the main program
if __name__ == '__main__':
    main()

All of these stages are conveniently bundled into our primary function. This defines the parameters, calls the functions we’ve written, and manages the entire process of scraping stuff from the web and loading it into the Deep Lake database. It then deletes the content file as a final step to clean up.

6 Voice Assistant

We’re ready to use this data in our chatbot now that we’ve successfully put all of the essential data in the vector database, in this case Deep Lake by Activeloop.

Without further ado, let’s get started on the coding portion of our chatbot. The following code can be found in the directory’s chat.py file. Run streamlit run chat.py to give it a shot.

These libraries will assist us in developing web apps with Streamlit, processing audio input, generating text responses, and efficiently collecting information from Deep Lake:

import os
import openai
import streamlit as st
from audio_recorder_streamlit import audio_recorder
from elevenlabs import generate
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from streamlit_chat import message

# Constants
TEMP_AUDIO_PATH = "temp_audio.wav"
AUDIO_FORMAT = "audio/wav"

# Load environment variables from .env file and return the keys
openai.api_key = os.environ.get('OPENAI_API_KEY')
eleven_api_key = os.environ.get('ELEVEN_API_KEY')

We then create an instance that points to our Deep Lake vector database.

def load_embeddings_and_database(active_loop_data_set_path):
    embeddings = OpenAIEmbeddings()
    db = DeepLake(
        dataset_path=active_loop_data_set_path,
        read_only=True,
        embedding_function=embeddings
    )
    return db

Next, we prepare the code for transcribing audio.

# Transcribe audio using OpenAI Whisper API
def transcribe_audio(audio_file_path, openai_key):
    openai.api_key = openai_key
    try:
        with open(audio_file_path, "rb") as audio_file:
            response = openai.Audio.transcribe("whisper-1", audio_file)
        return response["text"]
    except Exception as e:
        print(f"Error calling Whisper API: {str(e)}")
        return NoneCopy

This transcribes an audio file into text using the OpenAI Whisper API, requiring the path of the audio file and the OpenAI key as input parameters.

# Record audio using audio_recorder and transcribe using transcribe_audio
def record_and_transcribe_audio():
    audio_bytes = audio_recorder()
    transcription = None
    if audio_bytes:
        st.audio(audio_bytes, format=AUDIO_FORMAT)

        with open(TEMP_AUDIO_PATH, "wb") as f:
            f.write(audio_bytes)

        if st.button("Transcribe"):
            transcription = transcribe_audio(TEMP_AUDIO_PATH, openai.api_key)
            os.remove(TEMP_AUDIO_PATH)
            display_transcription(transcription)

    return transcription

# Display the transcription of the audio on the app
def display_transcription(transcription):
    if transcription:
        st.write(f"Transcription: {transcription}")
        with open("audio_transcription.txt", "w+") as f:
            f.write(transcription)
    else:
        st.write("Error transcribing audio.")

# Get user input from Streamlit text input field
def get_user_input(transcription):
    return st.text_input("", value=transcription if transcription else "", key="input")

This part of the code allows users to record audio directly within the application. The recorded audio is then transcribed into text using the Whisper API, and the transcribed text is displayed on the application. If any issues occur during the transcription process, an error message will be shown to the user.

# Search the database for a response based on the user's query
def search_db(user_input, db):
    print(user_input)
    retriever = db.as_retriever()
    retriever.search_kwargs['distance_metric'] = 'cos'
    retriever.search_kwargs['fetch_k'] = 100
    retriever.search_kwargs['maximal_marginal_relevance'] = True
    retriever.search_kwargs['k'] = 4
    model = ChatOpenAI(model_name='gpt-3.5-turbo')
    qa = RetrievalQA.from_llm(model, retriever=retriever, return_source_documents=True)
    return qa({'query': user_input})

This section of code searches the vector database for the most appropriate answers to the user’s inquiry. It initially turns the database into a retriever, which is a tool that searches the vector space for the closest embeddings. It then configures various search parameters, such as the metric to use when measuring distance in the embedding space, the number of documents to fetch initially, whether to use maximal marginal relevance to balance the diversity and relevance of the results, and how many results to return. The obtained results are then processed by the language model, which in this case is GPT-3.5 Turbo, to provide the most relevant response to the user’s inquiry.

7 Streamlit

Streamlit is a Python framework for developing web-based data visualisation applications, which I have used in earlier projects. It allows you to easily construct interactive web apps for machine learning and data science projects.

The part with the conversation history between the user and the chatbot using Streamlit’s messaging feature is now complete. It scrolls through the previous messages in the conversation, displaying each user message and the matching chatbot response. It makes use of the Eleven Labs API to translate the chatbot’s text response to speech and give it a voice. This MP3-formatted speech output is then played back on the Streamlit interface, giving an audio component to the conversation:

# Display conversation history using Streamlit messages
def display_conversation(history):
    for i in range(len(history["generated"])):
        message(history["past"][i], is_user=True, key=str(i) + "_user")
        message(history["generated"][i],key=str(i))
        #Voice using Eleven API
        voice= "Bella"
        text= history["generated"][i]
        audio = generate(text=text, voice=voice,api_key=eleven_api_key)
        st.audio(audio, format='audio/mp3')

8 User Interaction

After the knowledge base is set up, the next stage is user interaction. The voice assistant is designed to accept queries either in the form of voice recordings or typed text.

# Main function to run the app
def main():
    # Initialize Streamlit app with a title
    st.write("# JarvisBase 🧙")
   
    # Load embeddings and the DeepLake database
    db = load_embeddings_and_database(dataset_path)

    # Record and transcribe audio
    transcription = record_and_transcribe_audio()

    # Get user input from text input or audio transcription
    user_input = get_user_input(transcription)

    # Initialize session state for generated responses and past messages
    if "generated" not in st.session_state:
        st.session_state["generated"] = ["I am ready to help you"]
    if "past" not in st.session_state:
        st.session_state["past"] = ["Hey there!"]
        
    # Search the database for a response based on user input and update the session state
    if user_input:
        output = search_db(user_input, db)
        print(output['source_documents'])
        st.session_state.past.append(user_input)
        response = str(output["result"])
        st.session_state.generated.append(response)

    #Display conversation history using Streamlit messages
    if st.session_state["generated"]:
        display_conversation(st.session_state)

# Run the main function when the script is executed
if __name__ == "__main__":
    main()

This is the main driving force for the entire application. It first launches the Streamlit programme and loads the Deep Lake vector database and its embeddings. It then provides two means for user input: text or audio recording, which is later transcribed.

In a session state, the programme stores a record of previous user inputs and generated responses. When a new user input is received, the programme examines the database for the best possible response. This response is then saved to the session state.

Finally, the software displays the whole conversation history, including both user inputs and chatbot responses. If the input was voice, the chatbot’s responses are also generated in audio format utilising the Eleven Labs API.

You should now run the following command in your terminal:

streamlit run chat.py

When you execute your programme with the Streamlit command, it will launch a local web server and provide you with a URL where your application can be browsed using a web browser. You have two URLs in your case: a Network URL and an External URL.

Your application will run as long as the command in your terminal is active, and it will terminate when you stop the command (ctrl+Z) or close the terminal.

9 Trying Out the UI

We have now described the key code components and are ready to test the Streamlit app!

This is how it looks.

By clicking on the microphone icon, you will activate your microphone for a few seconds and be able to ask a question. Consider the question “How do I search for models in the Hugging Face Hub?”

After a few seconds, the app will display an audio player where you can listen to your registered audio. Then, select the “Transcribe” option.

This button will send a request to the Whisper API, which will transcribe your audio. The generated text will be quickly put beneath the chat text submission.

Here we see that the Whisper API didn’t do a perfect job at transcribing “Hugging Face” correctly and instead wrote “Huggy Face”. This is unwanted, but let’s see if ChatGPT is still able to understand the query and give it an appropriate answer by leveraging the knowledge documents stored in Deep Lake.

After a few more seconds, the underlying chat will be populated with your audio transcription, along with the chatbot’s textual response and its audio version, generated by calling the ElevenLabs API. As we can see, ChatGPT was smart enough to understand that “Huggy Face” was a misspelling of “Hugging Face” and was still able to give an appropriate answer.

10 Conclusion

We combined many popular generative AI tools and models in this post, including OpenAI Whisper and ElevenLabs text-to-speech.

11 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Creating a Voice Assistant for your Knowledge Base

Subscribe