Document Splitting with LangChain

In this article we look at how you can split documents as an important step in making content available for Large Language Models
natural-language-processing
deep-learning
langchain
Author

Pranath Fernando

Published

July 21, 2023

1 Introduction

In this article we look at how you can split documents as an important step in making content available for Large Language Models.

2 Document Splitting

Splitting documents may sound easy, but there are many small details make a great difference in the long run in this situation.

Once your data is loaded into the document format, document splitting takes place. However, before that happens, and despite how straightforward it may seem, it enters the vector store. Alternatively, you might divide the chunks into equal portions based on the lengths of the individual characters. However, let’s look at this scenario here as an illustration of why this is both more difficult and crucial in the long run. The Toyota Camry is mentioned in a sentence along with certain details.

If we simply split the sentence into two halves, we can wind up with one part of the sentence in one chunk and the other part in another. Later on, when we try to respond to an inquiry about the Camry’s specifications, we discover that neither chunk contains the necessary details, thus it is divided. Therefore, we would be unable to provide a suitable response to your query. So, how you divide the chunks so that you get semantically important chunks together requires a lot of intricacy and care.

The foundation of each and every text splitter in LangChain is the splitting of text into pieces of a certain size and overlap. We’ve included a small diagram below to illustrate how that would appear. There are various techniques to determine the chunk size.

To determine the chunk’s size, we therefore enable passing in a length function. It frequently consists of symbols or tokens. When moving from one chunk to the next, we typically keep a small overlap between the two chunks, much like a sliding window. Furthermore, it helps establish some semblance of consistency because it permits the identical contextual information to appear at both the beginning and the conclusion of two separate chunks. Each of Lang Chain’s text splitters has a way for creating documents and splitting them. Under the hood, the logic is the same; the interface, which can accept lists of text or lists of documents, is just a little bit different.

These text splitters come in a wide range of sizes. How the chunks are divided up and which characters go into each one can vary. The methods used to determine the size of the chunks can differ. Are the characters used? Do they use tokens? Some even divide sentences into chunks by using other, more precise models to identify when a sentence might be finished. The metadata is yet another crucial component of chunking. There are certain text splitters that are highly focused on keeping the same metadata across all chunks while also introducing new pieces of metadata when necessary.

When splitting on code, it becomes very clear that the splitting of chunks can frequently depend on the sort of document we’re working with. As a result, we have a language text splitter with a wide range of separators for languages like Python, Ruby, and C. Additionally, when separating these documents, it takes into account the various languages as well as the appropriate separators for each language.

3 Load Libs and Setup

The environment will first be set up as before by loading the Open AI API key. Next, we’ll import two of Lang Chain’s most popular categories of text splitters. the character text splitter as well as the recursive character text splitter. We’ll first experiment with a few toy use cases to acquire a better understanding of what these actually do. So, just to illustrate what they can do, we’re going to set a somewhat small chunk size of 26, and an even smaller chunk overlap of 4.

Let’s initialise these two text splitters as R and C, respectively. Finally, let’s examine a few alternative use-case scenarios.

import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
chunk_size =26
chunk_overlap = 4
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn’t this split the string below?

text1 = 'abcdefghijklmnopqrstuvwxyz'
r_splitter.split_text(text1)
['abcdefghijklmnopqrstuvwxyz']

Let’s insert the first string first. A, B, C, D, and so on all the way to Z. Let’s examine what transpires when we employ the various splitters. It remains one string after being divided with the recursive character text splitter. This is the case since we requested a chunk size of 26, but this is 26 characters long. Therefore, there isn’t even a need to separate anything here.

text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'
r_splitter.split_text(text2)
['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

Now let’s apply it to a string that is slightly longer than the 26 characters we designated as the chunk size. Here, two distinct chunks are formed, as can be seen. There are 26 characters total since the first one stops at Z. The following one begins with W, X, Y, and Z. These are the four chunk overlaps, after which the remainder of the string is played. Let’s examine a string that is a little bit more complicated and contains a lot of spaces between the characters. Now that there are spaces, we can see that it is divided into three halves and hence takes up more room.

Ok, this splits the string but we have an overlap specified as 5, but it looks like 3? (try an even number)

text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"
r_splitter.split_text(text3)
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']
c_splitter.split_text(text3)
['a b c d e f g h i j k l m n o p q r s t u v w x y z']

As a result, if we examine the overlap, we can see that L and M are present in the first one as well as the second. This appears to be simply two characters, but the space between the L and the M, as well as the spaces immediately before and after the L and the M, counts as four characters and makes up the chunk overlap.

Let’s test the character text splitter now, and we can see that it doesn’t even attempt to separate it when we run it. So what exactly is going on here? The problem is that the character text splitter only splits on one character, which is always a newline character. But there aren’t any newlines in this. We may observe what transpires if the separator is changed to an empty space. It is divided in the same manner as before here.

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)
['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In order to have a solid understanding of what is occurring when we go on to more realistic examples, it is also fascinating to play around with the chunk size, chunk overlap, and just generally have a sense of what is happening in a few toy cases.

4 Recursive splitting details

RecursiveCharacterTextSplitter is recommended for generic text.

This is a lengthy paragraph, and you can see that a double newline, which is used to demarcate paragraphs, is located approximately here. When we look at the text’s length, we find that it is just under 500 words. Let’s define the two text splitters we’ll be using next. The character text splitter will be used as before with a space as a separator, and after that, the recursive character text splitter will be initialised. The separators we pass in here are the default separators, but we’re putting them in this notebook to make it clearer what’s happening.

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""
len(some_text)
496
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

As a result, it is clear that the string is empty and is made up of a list of double newlines, single newlines, spaces, and double newlines. This means that when you split a text block, it will attempt to split it first by double newlines. The process will then switch to single newlines if further segmentation of the individual pieces is required. If further work is required, it then moves on to the space. And if it is truly necessary, it will simply go character by character at that point. We can tell that the character text splitter divides on spaces by seeing how they work with the above text.

c_splitter.split_text(some_text)
['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']
r_splitter.split_text(some_text)
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

Thus, the strange break in the middle of the phrase results. Here, the recursive text splitter divides the text into two paragraphs after first attempting to split on double newlines.

We specified that this is probably a better split even though the first one is shorter than the 450 characters because the two paragraphs, each of which is its own paragraph, are now in the chunks rather than being separated in the middle of a phrase.

Now let’s break it down even further to help us understand what is happening more clearly. A period will also be used as a separator. This is intended to divide sentences in half. The text splitter we are using splits the material into sentences, but the areas where the periods should be are incorrect. This is due to the regex that is operating behind the scenes. We can really specify a slightly more complex regex with a look behind to correct this.

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related",
 '. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns',
 '. Carriage returns are the "backslash n" you see embedded in this string',
 '. Sentences have a period at the end, but also, have a space.and words are separated by space.']
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)
["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related.",
 'For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns.',
 'Carriage returns are the "backslash n" you see embedded in this string.',
 'Sentences have a period at the end, but also, have a space.and words are separated by space.']

Now that we have ran this, we can see that it has been appropriately divided into sentences with the appropriate placement of the periods.

Let’s try it out on a real-world example using a PDF that we worked with before in the document loading phase. Once it has been loaded, define our text splitter here. We supply the length function here. LEN, a built-in Python function, is used in this. This is the default, but we’re specifying it to make it clearer what’s happening behind the scenes. This counts the number of characters.

We’re now utilising the divide documents function and passing in a list of documents since we wish to use documents.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)
docs = text_splitter.split_documents(pages)
len(docs)
77
len(pages)
22

The number of additional documents that have been produced as a result of this splitting may be seen when we compare the length of those documents to the length of the original pages. In addition, we can observe that after all the splitting, we now have a lot more docs when we compare the lengths of the original documents to the new split documents.

5 Token splitting

We have completed all character-based splitting to this point. The token text splitter will now be imported. However, there is another method of splitting that is based on tokens. We can also split on token count explicity, if we want.

The reason that this is useful is because often LLMs have context windows that are designated by token count. And so, it’s important to know what the tokens are, and where they appear. And then, we can split on them to have a slightly more representative idea of how the LLM would view them. To really get a sense for what the difference is between tokens and characters. Let’s initialize the token text splitter with a chunk size of 1, and a chunk overlap of 0.

from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

The result will be a list of all the tokens that are significant in the given text. Let’s generate a silly text for entertainment purposes. When we divide it, we can see that it has been divided into a number of distinct tokens, each of which varies somewhat in length and character count. The first one therefore starts with simply foo, followed by a space, a bar, another space, just the B, then AZ, ZY, and finally foo once more. The distinction between splitting on characters and splitting on tokens is thus somewhat illustrated.

text1 = "foo bar bazzyfoo"
text_splitter.split_text(text1)
['foo', ' bar', ' b', 'az', 'zy', 'foo']

Applying this to the documents we loaded earlier, we can similarly refer to the split documents on the pages. For example, if we look at the first document, we have our new split document, with the page content roughly being the title, and then we have the metadata of the source and the page it came from. If we look at this to confirm that the metadata for page 0 lines up, we can see that the source and page metadata are identical in the chunk as they were for the original document.

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
docs = text_splitter.split_documents(pages)
docs[0]
Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 0})
pages[0].metadata
{'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 0}

This is good it’s carrying through the metadata to each chunk appropriately, but there can also be cases where you actually want to add more metadata to the chunks as you split them. This can contain information like where in the document, the chunk came from where it is relative to other things or concepts in the document and generally this information can be used when answering questions to provide more context about what this chunk is exactly.

6 Context aware splitting

Let’s look at another kind of text splitter that really adds data to the metadata of each piece to see an actual illustration of this in action. This text splitter, called the markdown header text splitter, works by splitting a markdown file based on the header or any subheaders. It then adds those headers as content to the metadata fields, and any chunks that result from those splits will receive that information as well.

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use MarkdownHeaderTextSplitter to preserve header metadata in our chunks, as show below.

from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

Following a few sentences, we move on to another portion of an even smaller subheading before returning to chapter 2 and a few more sentences. Let’s create a list of the headings we wish to divide on, along with their names. First off, we just have one hashtag, which we’ll refer to as header 1. Next, we have two hashtags (header 2), three hashtags (header 3), and so forth. The above-mentioned toy example can then be divided when we initialise the markdown header text splitter with those headers.

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

If we take a look at a few of these examples, we can see the first one has the content, “Hi, this is Jim .” “Hi, this is Joe.” And now in the metadata, we have header 1, and then we have it as title and header 2 as chapter 1, and this is coming from right here in the example document above.

Let’s take a look at the next one, and we can see here that we’ve jumped down into an even smaller subsection. And so, we’ve got the content of “Hi, this is Lance” and now we’ve got not only header 1. But also header 2, and also header 3, and this is again coming from the content and names in the markdown document above.

md_header_splits[0]
Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})
md_header_splits[1]
Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

Let’s apply this to a real-world scenario first. Now that those documents have loaded, let’s define the markdown splitter so that header 1 is a single hashtag and header 2 is a double hashtag. The text is divided, and the splits are obtained.

If we examine them, the first one has the text of a page, and if we scroll down to the metadata, we can see that we have loaded header 1 as Blendel’s employee handbook. We have now covered how to obtain chunks with relevant semantic metadata. Moving those data chunks into a vector store is the next stage, which will be covered in the next article.

7 Acknowledgements

I’d like to express my thanks to the wonderful LangChain: Chat with your data course by DeepLearning.ai and LangChain - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe