Text Splitters for Retrieval and Large Language Models

Giving documents to the LLM as information sources and asking it to produce an answer based on the information it extracts from the document is one strategy for reducing hallucinations - in this article we will look at how text splitters can help with this
natural-language-processing
deep-learning
langchain
activeloop
openai
retrievers
Author

Pranath Fernando

Published

August 9, 2023

1 Introduction

Large Language Models are known for producing text that looks and reads like human beings, but they may also “hallucinate” and produce information that is both accurate and illogical. It’s interesting to note that this inclination can be helpful while undertaking creative work because it produces a variety of original and inventive thoughts, opening up fresh viewpoints and promoting the creative process. This presents a problem, though, in circumstances where accuracy is crucial, including code reviews, duties involving insurance, or answers to research-related questions.

Giving documents to the LLM as information sources and asking it to produce an answer based on the information it extracts from the document is one strategy for reducing hallucinations. Users can check the information with the source document and this can lessen the risk of hallucinations.

Let’s discuss the pros and cons of this approach:

Pros:

  1. Reduced hallucination: By providing a source document, the LLM is more likely to generate content based on the given information, reducing the chances of creating false or irrelevant information.
  2. Increased accuracy: With a reliable source document, the LLM can generate more accurate answers, especially in use cases where accuracy is crucial.
  3. Verifiable information: Users can cross-check the generated content with the source document to ensure the information is accurate and reliable.

Cons:

  1. Limited scope: Relying on a single document may limit the scope of the generated content, as the LLM will only have access to the information provided in the document.
  2. Dependence on document quality: The accuracy of the generated content heavily depends on the quality and reliability of the source document. The LLM will likely generate incorrect or misleading content if the document contains inaccurate or biased information.
  3. Inability to eliminate hallucination completely: Although providing a document as a base reduces the chances of hallucination, it does not guarantee that the LLM will never generate false or irrelevant information.

Another issue is that LLMs are unable to feed complete documents due to a limit prompt size. Because of this, it is essential to break documents into smaller pieces, and Text Splitters are quite helpful in doing so. Text splitters make it easier for language models to process huge text documents by dividing them into smaller, more manageable portions.

As smaller segments may be more likely to match a query, using a Text Splitter can also enhance the performance of vector store searches. It can be helpful to experiment with various chunk sizes and overlaps in order to adapt the results to your particular requirements.

2 Import Libs & Setup

from langchain.document_loaders import PyPDFLoader

3 Customizing a Text Splitter

It’s critical to divide large texts into manageable sections while processing them. Due to the importance of preserving semantically connected text fragments, this initially straightforward process can quickly become complicated. Depending on the sort of writing, “semantically related” may mean several things. We’ll look at a number of approaches in this post to do this.

Text splitters often go through the following steps:

  1. Divide the text into small, semantically meaningful chunks (often sentences).
  2. Combine these small chunks into a larger one until a specific size is reached (determined by a particular function).
  3. Once the desired size is attained, separate that chunk as an individual piece of text, then start forming a new chunk with some overlap to maintain context between segments.

Consequently, there are two primary dimensions to consider when customizing your text splitter:

  • The method used to split the text
  • The approach for measuring chunk size

4 Character Text Splitter

This kind of splitter can be used in a variety of situations when it is necessary to break up lengthy passages of text into more manageable, semantically sound portions. For simpler processing or analysis, you might utilise it to divide a lengthy article into manageable bits. To balance the trade-offs between dividing the text into digestible chunks and maintaining semantic context between chunks, the splitter gives you the option to customise the chunking process along two axes: chunk size and chunk overlap.

Use the PyPDFLoader class to load the files.

loader = PyPDFLoader("The One Page Linux Manual.pdf")
pages = loader.load_and_split()

By loading the text file, we can ask more specific questions related to the subject, which helps minimize the likelihood of LLM hallucinations and ensures more accurate, context-driven responses.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(pages)

print(texts[0])
page_content='THE ONE     PAGE LINUX MANUALA summary of useful Linux commands\nVersion 3.0 May 1999 squadron@powerup.com.au\nStarting & Stopping\nshutdown -h now Shutdown the system now and do not\nreboot\nhalt Stop all processes - same as above\nshutdown -r 5 Shutdown the system in 5 minutes and\nreboot\nshutdown -r now Shutdown the system now and reboot\nreboot Stop all processes and then reboot - same\nas above\nstartx Start the X system\nAccessing & mounting file systems\nmount -t iso9660 /dev/cdrom\n/mnt/cdromMount the device cdrom\nand call it cdrom under the\n/mnt directory\nmount -t msdos /dev/hdd\n/mnt/ddriveMount hard disk “d” as a\nmsdos file system and call\nit ddrive under the /mnt\ndirectory\nmount -t vfat /dev/hda1\n/mnt/cdriveMount hard disk “a” as a\nVFAT file system and call it\ncdrive under the /mnt\ndirectory\numount /mnt/cdrom Unmount the cdrom\nFinding files and text within files\nfind / -name  fname Starting with the root directory, look\nfor the file called fname\nfind / -name ”*fname* ” Starting with the root directory, look\nfor the file containing the string fname\nlocate missingfilename Find a file called missingfilename\nusing the locate command - this\nassumes you have already used the\ncommand updatedb (see next)\nupdatedb Create or update the database of files\non all file systems attached to the linux\nroot directory\nwhich missingfilename Show the subdirectory containing the\nexecutable file  called missingfilename\ngrep textstringtofind\n/dirStarting with the directory called dir ,\nlook for and list all files containing\ntextstringtofind\nThe X Window System\nxvidtune Run the X graphics tuning utility\nXF86Setup Run the X configuration menu with\nautomatic probing of graphics cards\nXconfigurator Run another X configuration menu with\nautomatic probing of graphics cards\nxf86config Run a text based X configuration menu\nMoving, copying, deleting & viewing files\nls -l List files in current directory using\nlong format\nls -F List files in current directory and\nindicate the file type\nls -laC List all files in current directory in\nlong format and display in columnsrm name Remove a file or directory called\nname\nrm -rf name Kill off an entire directory and all it’s\nincludes files and subdirectories\ncp filename\n/home/dirnameCopy the file called filename to the\n/home/dirname directory\nmv filename\n/home/dirnameMove the file called filename to the\n/home/dirname directory\ncat filetoview Display the file called filetoview\nman -k keyword Display man pages containing\nkeyword\nmore filetoview Display the file called filetoview one\npage at a time, proceed to next page\nusing the spacebar\nhead filetoview Display the first 10 lines of the file\ncalled filetoview\nhead -20 filetoview Display the first 20 lines of the file\ncalled filetoview\ntail filetoview Display the last 10 lines of the file\ncalled filetoview\ntail -20 filetoview Display the last 20 lines of the file\ncalled filetoview\nInstalling software for Linux\nrpm -ihv name.rpm Install the rpm package called name\nrpm -Uhv name.rpm Upgrade the rpm package called\nname\nrpm -e package Delete the rpm package called\npackage\nrpm -l package List the files in the package called\npackage\nrpm -ql package List the files and state the installed\nversion of the package called\npackage\nrpm -i --force package Reinstall the rpm package called\nname having deleted parts of it (not\ndeleting using rpm -e)\ntar -zxvf archive.tar.gz or\ntar -zxvf archive.tgzDecompress the files contained in\nthe zipped and tarred archive called\narchive\n./configure Execute the script preparing the\ninstalled files for compiling\nUser Administration\nadduser accountname Create a new user call accountname\npasswd accountname Give accountname a new password\nsu Log in as superuser from current login\nexit Stop being superuser and revert to\nnormal user\nLittle known tips and tricks\nifconfig List ip addresses for all devices on\nthe machine\napropos subject List manual pages for subject\nusermount Executes graphical application for\nmounting and unmounting file\nsystems' metadata={'source': 'The One Page Linux Manual.pdf', 'page': 0}

There is no single method for chunking text that works in all situations; what works in one situation may not work in another. Going through a few procedures is necessary to determine the ideal chunk size for your project. First, purge any unnecessary information from your data, such as HTML elements from websites. Select a couple other chunk sizes to test after that. The type of data you’re working with and the model you’re using will determine the optimal size. Finally, execute various queries and compare the results to see how well each size performs. Before choosing the ideal size, you might have to try a few different ones. Even if it could take some time, the best outcomes are worthwhile.

print (f"You have {len(texts)} documents")
You have 2 documents
print ("Preview:")
print (texts[0].page_content)
Preview:
THE ONE     PAGE LINUX MANUALA summary of useful Linux commands
Version 3.0 May 1999 squadron@powerup.com.au
Starting & Stopping
shutdown -h now Shutdown the system now and do not
reboot
halt Stop all processes - same as above
shutdown -r 5 Shutdown the system in 5 minutes and
reboot
shutdown -r now Shutdown the system now and reboot
reboot Stop all processes and then reboot - same
as above
startx Start the X system
Accessing & mounting file systems
mount -t iso9660 /dev/cdrom
/mnt/cdromMount the device cdrom
and call it cdrom under the
/mnt directory
mount -t msdos /dev/hdd
/mnt/ddriveMount hard disk “d” as a
msdos file system and call
it ddrive under the /mnt
directory
mount -t vfat /dev/hda1
/mnt/cdriveMount hard disk “a” as a
VFAT file system and call it
cdrive under the /mnt
directory
umount /mnt/cdrom Unmount the cdrom
Finding files and text within files
find / -name  fname Starting with the root directory, look
for the file called fname
find / -name ”*fname* ” Starting with the root directory, look
for the file containing the string fname
locate missingfilename Find a file called missingfilename
using the locate command - this
assumes you have already used the
command updatedb (see next)
updatedb Create or update the database of files
on all file systems attached to the linux
root directory
which missingfilename Show the subdirectory containing the
executable file  called missingfilename
grep textstringtofind
/dirStarting with the directory called dir ,
look for and list all files containing
textstringtofind
The X Window System
xvidtune Run the X graphics tuning utility
XF86Setup Run the X configuration menu with
automatic probing of graphics cards
Xconfigurator Run another X configuration menu with
automatic probing of graphics cards
xf86config Run a text based X configuration menu
Moving, copying, deleting & viewing files
ls -l List files in current directory using
long format
ls -F List files in current directory and
indicate the file type
ls -laC List all files in current directory in
long format and display in columnsrm name Remove a file or directory called
name
rm -rf name Kill off an entire directory and all it’s
includes files and subdirectories
cp filename
/home/dirnameCopy the file called filename to the
/home/dirname directory
mv filename
/home/dirnameMove the file called filename to the
/home/dirname directory
cat filetoview Display the file called filetoview
man -k keyword Display man pages containing
keyword
more filetoview Display the file called filetoview one
page at a time, proceed to next page
using the spacebar
head filetoview Display the first 10 lines of the file
called filetoview
head -20 filetoview Display the first 20 lines of the file
called filetoview
tail filetoview Display the last 10 lines of the file
called filetoview
tail -20 filetoview Display the last 20 lines of the file
called filetoview
Installing software for Linux
rpm -ihv name.rpm Install the rpm package called name
rpm -Uhv name.rpm Upgrade the rpm package called
name
rpm -e package Delete the rpm package called
package
rpm -l package List the files in the package called
package
rpm -ql package List the files and state the installed
version of the package called
package
rpm -i --force package Reinstall the rpm package called
name having deleted parts of it (not
deleting using rpm -e)
tar -zxvf archive.tar.gz or
tar -zxvf archive.tgzDecompress the files contained in
the zipped and tarred archive called
archive
./configure Execute the script preparing the
installed files for compiling
User Administration
adduser accountname Create a new user call accountname
passwd accountname Give accountname a new password
su Log in as superuser from current login
exit Stop being superuser and revert to
normal user
Little known tips and tricks
ifconfig List ip addresses for all devices on
the machine
apropos subject List manual pages for subject
usermount Executes graphical application for
mounting and unmounting file
systems

#=====

5 Recursive Character Text Splitter

A text splitter called the Recursive Character Text Splitter divides the text into sections based on a list of characters that is provided. Up until the resulting chunks are small enough, it makes an effort to separate text using the characters from a list in order. As paragraphs, sentences, and words are typically the most semantically linked units of text, the default list of characters used for splitting is [“nn”, “n”, ” “,”], which aims to keep them together for as long as possible. This indicates that the text is first split into two new-line characters by the class.

The output will then be split by a single new-line character, followed by a space character, and so on, until the appropriate chunk size is reached if the resulting chunks are still greater than the desired chunk size.

You can make an instance of the RecursiveCharacterTextSplitter and supply the following parameters to use it:

  • chunk_size : The maximum size of the chunks, as measured by the length_function (default is 100).
  • chunk_overlap: The maximum overlap between chunks to maintain continuity between them (default is 20).
  • length_function: parameter is used to calculate the length of the chunks. By default, it is set to len, which counts the number of characters in a chunk. However, you can also pass a token counter or any other function that calculates the length of a chunk based on your specific requirements.

In some circumstances, such as when dealing with language models with token restrictions, it may be advantageous to use a token counter rather than the default len function. To better control and optimise your requests, you might choose to count tokens rather than characters since OpenAI’s GPT-3 has a token restriction of 4096 tokens per request.

Here is a demonstration of RecursiveCharacterTextSplitter in use.

!echo "Helllo, my name is Ala\n Hello again\n\ntesting newline." > LLM.txt
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load a long document
with open('LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
    length_function=len,
)
texts = text_splitter.create_documents([sample_text])
print(texts)
[Document(page_content='Helllo, my name is Ala\n Hello again', metadata={}), Document(page_content='testing newline.', metadata={})]

With the needed parameters, we constructed a RecursiveCharacterTextSplitter class instance. The predetermined list of characters is [“n”, “n”, ” “,””].

Two new-line characters (nn) are used to first separate the text. The class then tries to divide the output by a single new-line character (n) because the chunks are still bigger than the required chunk size (50).

The RecursiveCharacterTextSplitter is used in this example to split the text into chunks with a maximum size of 50 characters and an overlap of 10 characters. A list of papers with the divided text will be the output.

You can build a custom function that determines the number of tokens in a given text and supply it as the length_function parameter to use a token counter. By doing this, you can make sure that your text splitter determines the length of chunks using the number of tokens rather than the number of letters.

6 NLTK Text Splitter

The Natural Language Toolkit (NLTK) TextSplitter in LangChain is an implementation of a text splitter that splits text based on tokenizers using the NLTK library. The objective is to break up lengthy texts into manageable pieces while maintaining the sentences’ and paragraphs’ natural order.

import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
True
# Load a long document
with open('LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=500)


texts = text_splitter.split_text(sample_text)
print(texts)
['Helllo, my name is Ala\n Hello again\n\ntesting newline.']

The NLTKTextSplitter is not, as you noted in your context, specifically made to handle word segmentation in English sentences without spaces. Alternative libraries like pyenchant or word segment can be used for this.

7 SpacyTextSplitter

The SpacyTextSplitter assists in breaking up huge text documents into pieces of a predefined size. For better control of massive text inputs, this is helpful. The SpacyTextSplitter is an alternative to NLTK-based sentence splitting, it is crucial to highlight. By specifying the chunk_size option, which is determined by a length function supplied to it and defaults to the number of characters, you can create a SpacyTextSplitter object.

from langchain.text_splitter import SpacyTextSplitter


# Load a long document
with open('LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Instantiate the SpacyTextSplitter with the desired chunk size
text_splitter = SpacyTextSplitter(chunk_size=500, chunk_overlap=20)


# Split the text using SpacyTextSplitter
texts = text_splitter.split_text(sample_text)

# Print the first chunk
print(texts)
['Helllo, my name is Ala\n \n\nHello again\n\ntesting newline.']

8 MarkdownTextSplitter

The MarkdownTextSplitter is intended to separate text into headers, code blocks, and dividers that are written in Markdown. It is constructed as a straightforward RecursiveCharacterSplitter subclass with separators unique to Markdown. These separators are set by the Markdown syntax by default, but they can be changed by giving a list of characters when the MarkdownTextSplitter object is initialised. The length function that is provided in calculates the chunk size, which is initially set to the amount of characters. When initialising an instance, include an integer value to specify the chunk size.

from langchain.text_splitter import MarkdownTextSplitter
markdown_text = """
#

# Welcome to My Blog!

## Introduction
Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python, Java, and JavaScript.

Here's a list of my favorite programming languages:

1. Python
2. JavaScript
3. Java

You can check out some of my projects on [GitHub](https://github.com).

## About this Blog
In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on the latest technology trends, and occasional book reviews.

Here's a small piece of Python code to say hello:

\``` python
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("John")
\```

Stay tuned for more updates!

## Contact Me
Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at johndoe@email.com.

"""
markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
print(docs)
[Document(page_content='# \n\n# Welcome to My Blog!', metadata={}), Document(page_content='Introduction', metadata={}), Document(page_content='Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python,', metadata={}), Document(page_content='Java, and JavaScript.', metadata={}), Document(page_content="Here's a list of my favorite programming languages:\n\n1. Python\n2. JavaScript\n3. Java", metadata={}), Document(page_content='You can check out some of my projects on [GitHub](https://github.com).', metadata={}), Document(page_content='About this Blog', metadata={}), Document(page_content="In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on", metadata={}), Document(page_content='the latest technology trends, and occasional book reviews.', metadata={}), Document(page_content="Here's a small piece of Python code to say hello:", metadata={}), Document(page_content='\\``` python\ndef say_hello(name):\n    print(f"Hello, {name}!")\n\nsay_hello("John")\n\\', metadata={}), Document(page_content='Stay tuned for more updates!', metadata={}), Document(page_content='Contact Me', metadata={}), Document(page_content='Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at', metadata={}), Document(page_content='johndoe@email.com.', metadata={})]

The MarkdownTextSplitter provides a useful way to split text while keeping the organisation and meaning that Markdown style affords. The information can be intelligently divided into parts that are more semantically cohesive by recognising the Markdown syntax (such as headings, lists, and code blocks). When handling lengthy Markdown documents, this splitter is extremely helpful.

9 TokenTextSplitter

Utilising TokenTextSplitter over other text splitters, such as CharacterTextSplitter, has the key benefit of respecting token bounds, preventing token splits in the middle of chunks. When using language models and embeddings, this can be very useful in preserving the text’s semantic integrity.

By first encoding the text as BPE (Byte Pair Encoding) tokens and then breaking these tokens into chunks, this kind of splitter breaks down raw text strings into manageable portions. The tokens included in each chunk are then put back together to form text. Using this class requires the tiktoken Python package. Pip install -q tiktoken

from langchain.text_splitter import TokenTextSplitter

# Load a long document
with open('LLM.txt', encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Initialize the TokenTextSplitter with desired chunk size and overlap
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)

# Split into smaller chunks
texts = text_splitter.split_text(sample_text)
print(texts[0])
Helllo, my name is Ala
 Hello again

testing newline.

The chunk_overlap parameter specifies the number of overlapping tokens between consecutive chunks, and chunk_size specifies the maximum number of BPE tokens in each chunk. You can fine-tune the granularity of the text pieces by changing these parameters.

When converting text to BPE tokens and back, TokenTextSplitter may require additional computation, which could be a disadvantage. CharacterTextSplitter is a text segmentation tool that splits text into segments based on character count, making it an option if you require a quicker and easier text segmentation method.

10 Recap

Text splitters are crucial for handling lengthy text, increasing the effectiveness of language model processing, and improving vector store search results. Choosing the splitting technique and determining the chunk size are necessary when customising text splitters.

A tool that helps strike a compromise between digestible chunks and semantic context preservation is CharacterTextSplitter. The outcomes are tailored for certain use cases by experimentation with various chunk sizes and overlaps.

While giving a range of adjustable chunk sizes and overlaps, RecursiveCharacterTextSplitter places a strong emphasis on maintaining semantic linkages.

The Natural Language Toolkit library is used by NLTKTextSplitter to split text more precisely. Utilising the well-known SpaCy library, SpacyTextSplitter divides texts based on linguistic characteristics. Specifically designed for Markdown-formatted texts, MarkdownTextSplitter makes sure that content is divided in a way that makes sense given the grammar. TokenTextSplitter offers a fine-grained method for text segmentation by using BPE tokens for splitting.

11 Conclusion

To get the best results for your text processing jobs, choose the right text splitter based on your unique needs and the type of text you are dealing with.

Further Reading:

https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter

https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter

https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

12 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe