LivingDataLab - Streamlined Data Ingestion for LLMs

1 Introduction

The LangChain library provides a number of assistance classes that are intended to make it easier to load and extract data from various sources. These classes simplify managing various data formats, regardless of whether the information came from a PDF file or online content.

The PyPDFLoader handles PDF files and offers quick access to content and metadata, whereas the TextLoader handles plain text files. SeleniumURLLoader is made to load HTML files from URLs that need to render JavaScript. Last but not least, the Google Drive Loader offers smooth Google Drive connectivity, enabling the import of data from Google Docs or folders.

2 Import Libs & Setup

from langchain.document_loaders import TextLoader
import os
import openai
import sys
sys.path.append('../..')
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

3 TextLoader

From langchain.document_loaders, import the LangChain and any required loaders. Keep in mind to use the following command to install the necessary packages: pip Install deeplake openai langchain==0.0.208 tiktoken.

You can use the encoding argument to change the encoding type. (For example: encoding=“ISO-8859-1”)

loader = TextLoader('docs/my_file.txt')
documents = loader.load()

4 PyPDFLoader (PDF)

PyPDFLoader and PDFMinerLoader are two methods offered by the LangChain library for loading and processing PDF files. The former, which is used to load PDF files into an array of documents, is what we mostly concentrate on. Each document in the array comprises the page content and metadata with the page number. Install the package first using PIP, the Python Package Manager.

Here is some code that uses PyPDFLoader to load and split a PDF file:

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load_and_split()

print(pages[0])

page_content='MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we\'ll start to  talk a bit about machine learning.  \nBy way of introduction, my name\'s  Andrew Ng and I\'ll be instru ctor for this class. And so \nI personally work in machine learning, and I\' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I\'m actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learning. Paul Baumstarck \nworks in machine learning and computer vision.  Catie Chang is actually a neuroscientist \nwho applies machine learning algorithms to try to understand the human brain. Tom Do \nis another PhD student, works in computa tional biology and in sort of the basic \nfundamentals of human learning. Zico Kolter is  the head TA — he\'s head TA two years \nin a row now — works in machine learning a nd applies them to a bunch of robots. And \nDaniel Ramage is — I guess he\'s not here  — Daniel applies l earning algorithms to \nproblems in natural language processing.  \nSo you\'ll get to know the TAs and me much be tter throughout this quarter, but just from \nthe sorts of things the TA\'s do, I hope you can  already tell that machine learning is a \nhighly interdisciplinary topic in which just the TAs find l earning algorithms to problems \nin computer vision and biology and robots a nd language. And machine learning is one of \nthose things that has and is having a large impact on many applications.  \nSo just in my own daily work, I actually frequently end up talking to people like \nhelicopter pilots to biologists to people in  computer systems or databases to economists \nand sort of also an unending stream of  people from industry coming to Stanford \ninterested in applying machine learni ng methods to their own problems.  \nSo yeah, this is fun. A couple of weeks ago, a student actually forwar ded to me an article \nin "Computer World" about the 12 IT skills th at employers can\'t say no to. So it\'s about \nsort of the 12 most desirabl e skills in all of IT and all of information technology, and \ntopping the list was actually machine lear ning. So I think this is a good time to be \nlearning this stuff and learning algorithms and having a large impact on many segments \nof science and industry.  \nI\'m actually curious about something. Learni ng algorithms is one of the things that \ntouches many areas of science and industrie s, and I\'m just kind of curious. How many \npeople here are computer science majors, are in the computer science department? Okay. \nAbout half of you. How many people are from  EE? Oh, okay, maybe about a fifth. How' metadata={'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 0}

Benefits of using PyPDFLoader include easy access to page content and structured metadata, such as page numbers, as well as simple, clear usage. However, it has drawbacks, such as less effective text extraction than PDFMinerLoader.

5 SeleniumURLLoader (URL)

A powerful yet simple method for loading HTML documents from a list of URLs that need JavaScript rendering is provided by the SeleniumURLLoader module. The Python Package Manager (PIP) is used to install the package in the following tutorial and example of how to use this class. The codes have been tested with 0.7.7 and 4.10.0 of the Selenium libraries, respectively. However, you’re welcome to set up the most recent versions.

Instantiate the SeleniumURLLoader class by providing a list of URLs to load, for example:

from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

print(data[0])

page_content="Make this year more efficient with monday.com\n\nInfo\n\nShopping\n\nWatch Later\n\nShare\n\nCopy link\n\nTap to unmute\n\nIf playback doesn't begin shortly, try restarting your device.\n\nYou're signed out\n\nVideos that you watch may be added to the TV's watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.\n\nSwitch camera\n\nShare\n\nAn error occurred while retrieving sharing information. Please try again later.\n\n0:00\n\n0:00 / 0:47\n\nWatch full video\n\n•\n\nScroll for details\n\nAbout\n\nPress\n\nCopyright\n\nContact us\n\nCreator\n\nAdvertise\n\nDevelopers\n\nTerms\n\nPrivacy\n\nPolicy & Safety\n\nHow YouTube works\n\nTest new features\n\n© 2023 Google LLC" metadata={'source': 'https://www.youtube.com/watch?v=TFa539R09EQ&t=139s'}

The SeleniumURLLoader class includes the following attributes:

URLs (List): List of URLs to load.
continue_on_failure (bool, default=True): Continues loading other URLs on failure if True.
browser (str, default=“chrome”): Browser selection, either ‘Chrome’ or ‘Firefox’.
executable_path (Optional, default=None): Browser executable path.
headless (bool, default=True): Browser runs in headless mode if True.

Customize these attributes during SeleniumURLLoader instance initialization, such as using Firefox instead of Chrome by setting the browser to “firefox”:

loader = SeleniumURLLoader(urls=urls, browser="firefox")

Upon invoking the load() method, a list of Document instances containing the loaded content is returned. Each Document instance includes a page_content attribute with the extracted text from the HTML and a metadata attribute containing the source URL.

Bear in mind that SeleniumURLLoader may be slower than other loaders since it initializes a browser instance for each URL. Nevertheless, it is advantageous for loading pages necessitating JavaScript rendering.

6 Conclusion

In conclusion, the combination of numerous potent loaders, such as TextLoader, PyPDFLoader, and SeleniumURLLoader, has greatly improved the process of efficient data input. Each of these tools is designed to work with particular file types and data sources, resulting in effective and thorough data management.

7 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Streamlined Data Ingestion for LLMs

Subscribe