Enhancing Data Structuring through Tagging and Extraction with OpenAI and LangChain

Structured data extraction is increasingly becoming an essential tool for developers who wish to harness the power of OpenAI’s capabilities. This blog post aims to provide an understanding of how developers use OpenAI functions for tagging and extraction - two primary use cases central to transforming unstructured text into structured, actionable data.
natural-language-processing
langchain
openai
Author

Pranath Fernando

Published

November 6, 2023

1 Introduction

Structured data extraction is increasingly becoming an essential tool for developers who wish to harness the power of Large Language Model capabilities. This blog post aims to provide a comprehensive understanding of how developers can use OpenAI functions for tagging and extraction, two primary use cases central to transforming unstructured text into structured, actionable data.

2 Understanding Tagging

2.1 The Concept of Tagging

Tagging involves submitting unstructured text to an OpenAI language model along with structured instructions. The language model then generates a structured output, creating a response that aligns with the provided description. This output typically contains tags corresponding to the sentiment and language of the input text.

So, in this example, we know that we want to generate an object that has the text’s emotion as well as a tag for the language that we pass in. So, when we bring in an implicit text, we’ll pass in a structured description that says “extract some sentiment, extract some language,” and the LLM will reason over that text and return an object with sentiment and language tags. This is comparable to, but slightly distinct from, the second use case, extraction.

2.2 Practical Example of Tagging

For instance, if we want to determine the sentiment of a text and its language, we structure a request that specifies these requirements. The model processes the text and returns an object tagged with both sentiment and language. This allows for a nuanced understanding of the content, which is crucial for various applications.

3 The Process of Extraction

3.1 Distinguishing Extraction from Tagging

Extraction differs from tagging as it involves identifying and retrieving specific entities from text. Unlike tagging, where a single structured output is generated, extraction yields a list of elements, such as the names of mentioned academic papers in an article.

3.2 Implementing Extraction in Code

We begin by importing necessary functions and classes, creating models that define the structured output we aim to extract. With these models, the OpenAI function can parse the text and return the requested entities in a structured format.

import os
import openai
import warnings
warnings.filterwarnings('ignore')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
from typing import List
from pydantic import BaseModel, Field
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

4 Tagging and Extraction in Practice

4.1 Creating a Tagging Model

Let’s develop a tagging model using PyDantic. We define a schema instructing the model to tag sentiments and languages, outlining possible values for each tag. This schema guides the language model to understand and shape the data we’re extracting.

So we’ll call the class tagging. We’ll have a description that says tag the piece of text, which is your information, and then a list of the text that we have to utilise that we want to tag the text to. So, first, sentiment, and then a description of the sentiment. It has to be called pos, neg, or neutral. So we’re defining values for the sentiment field, and remember, we’re fitting past in the language model. So this is how we inform the language model what shape the data we’re retrieving should have. We then have a language tag in here to get into the language.

So we’ll start with a description of the sentiment field. Because we know we’ll always do tagging, we’re attempting to force it to always do this tagging function. We can then establish a tagging chain by combining the prompt with this model and calling it…

class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    language: str = Field(description="language of text (should be ISO 639-1 code)")
convert_pydantic_to_openai_function(Tagging)
{'name': 'Tagging',
 'description': 'Tag the piece of text with particular info.',
 'parameters': {'title': 'Tagging',
  'description': 'Tag the piece of text with particular info.',
  'type': 'object',
  'properties': {'sentiment': {'title': 'Sentiment',
    'description': 'sentiment of text, should be `pos`, `neg`, or `neutral`',
    'type': 'string'},
   'language': {'title': 'Language',
    'description': 'language of text (should be ISO 639-1 code)',
    'type': 'string'}},
  'required': ['sentiment', 'language']}}
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
model = ChatOpenAI(temperature=0)
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])
model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"}
)
tagging_chain = prompt | model_with_functions
tagging_chain.invoke({"input": "I love langchain"})
AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{\n  "sentiment": "pos",\n  "language": "en"\n}'}})
tagging_chain.invoke({"input": "non mi piace questo cibo"})
AIMessage(content='', additional_kwargs={'function_call': {'name': 'Tagging', 'arguments': '{\n  "sentiment": "neg",\n  "language": "it"\n}'}})

And we may refer to this as a piece of text and obtain that response. So we’re running the tagging function, and we can see that the prompt and call are present, as well as the arguments that are sent in, and we can see that the sentiment is positive and the language is English. This is something we can do with another piece of text. We can change it up frequently by using a different language and a different sentiment.

5 Improving Output with Helper Parsers

And we know we’ll always be extracting the structure, so what we actually want to do is add an output parser that takes in this AI message, parses the output JSON, and simply states that, because it’s the only interesting thing here. We already know we’re going to call this method, so the fact that content is null is irrelevant to us. We’re not interested in the fact that there’s a function called this. We’re making it do it. The fact that it is calling the tagging function is likewise uninteresting to us, because we know it will call this tagging function if we compel it to.

from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()
tagging_chain.invoke({"input": "non mi piace questo cibo"})
{'sentiment': 'neg', 'language': 'it'}

5.1 Extracting Information

Moving on to extraction, we define another model to capture multiple pieces of information, such as names and ages from a text. We then instruct the language model to parse these elements into a list of structured objects.

Extraction is similar to tagging in that it extracts many pieces of information. We’ll start by defining the bits of information we want to extract, and this is a person’s theme, so we’ll have information about a person, the name, which is a person’s name, and the age. We’ll go ahead and mark it as an optional integer. We want to extract a list of these objects, therefore we’ll construct another class called information that contains only the information that we want to extract. We’ll add a persons attribute, and this will be a list of the person’s type.

from typing import Optional
class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description="person's name")
    age: Optional[int] = Field(description="person's age")
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description="List of info about people")

So we can convert that class to OpenAI functions here, and we can see that we have information, properties, the main property is people, and then we can see that we now have the person’s description here if we look at the description of people. So, if we just convert the entity to the OpenAI function method, we take care of resolving any mentions and putting all of the essential information from the JavaScript zone block. So now we’ll set up an extraction chain.

convert_pydantic_to_openai_function(Information)
{'name': 'Information',
 'description': 'Information to extract.',
 'parameters': {'title': 'Information',
  'description': 'Information to extract.',
  'type': 'object',
  'properties': {'people': {'title': 'People',
    'description': 'List of info about people',
    'type': 'array',
    'items': {'title': 'Person',
     'description': 'Information about a person.',
     'type': 'object',
     'properties': {'name': {'title': 'Name',
       'description': "person's name",
       'type': 'string'},
      'age': {'title': 'Age',
       'description': "person's age",
       'type': 'integer'}},
     'required': ['name']}}},
  'required': ['people']}}

First, let’s define some extraction functions. On this information block, we’ll call convert_pydantic_to_openai_function to the information function. The extraction model will be built up next. So we’ll bind functions equal to extraction functions because we want to use them, and then we’ll bind function call with the name, so that equals information since the name of the function we want the model to call is information.

extraction_functions = [convert_pydantic_to_openai_function(Information)]
extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})
extraction_model.invoke("Joe is 30, his mom is Martha")
AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{\n  "people": [\n    {\n      "name": "Joe",\n      "age": 30\n    },\n    {\n      "name": "Martha",\n      "age": 0\n    }\n  ]\n}'}})

So let’s put it to the test with a simple statement. We can see that it extracts the name Joe and his age of 30. This time, we receive the second person, Martha. It is zero.

So we’ve obviously indicated Martha being here, and the model appears to believe that if it doesn’t know the individual, it will simply put zero, which is definitely a way we can improve. We can compel the model to respond more intelligently. So what we’re going to do is add a prompt that will instruct the language model to accomplish that. So we’re adding a prompt now, and we’ve got a system that just says “extract the relevant information.” Do not get that extract partial information if it is not expressly provided.

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"),
    ("human", "{input}")
])
extraction_chain = prompt | extraction_model
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})
AIMessage(content='', additional_kwargs={'function_call': {'name': 'Information', 'arguments': '{\n  "people": [\n    {\n      "name": "Joe",\n      "age": 30\n    },\n    {\n      "name": "Martha"\n    }\n  ]\n}'}})

So, hopefully, it will either compel the language model or allow the language model to not always answer with the name. It will not make up the value zero for age. As a result, we are free to extract. This prompt is followed by the extraction model. And if we call this extraction chain the same thing, we can see that in the argument we have, name, Martha, and there’s nothing else to do about age, it’s correctly thinking that it doesn’t need to provide information on our age. Again, we’re probably capable of doing better than this AI message.

extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})
{'people': [{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]}

5.2 Streamlining the Extraction Process

By employing a JSON key output function parser, we can extract specific data points without extraneous information. This refined output is not only cleaner but also more functional for subsequent data handling.

And so, in this case, all we actually want is the value of the arguments, which is a JSON block, and it’s quite convenient to have those parses with JSON because it’s JSON in this JSON block. We want to be able to use the various elements individually. As a result, LangChain includes a handy small output parser that can assist with this. It’s named JSONOutputFunctionsParser, and we’ll import it from the OpenAI functions of LangChainOutputParser. Then we create our tagging chain, which we can do as a prompt, and we can connect it with the function model, and we’ll now add this output parser into the final element.

We can try to structure it by parsing it. So let’s bring back the JSON assistance parser from previously. We can see that it’s processed into this dictionary containing persons and then a list of names, as we call it again. However, there is some unnecessary information here that we don’t actually need for extraction. We don’t really care about this list of people since, as we defined it, information is merely a vehicle that allows us to extract many components from this person. What we truly care about is this list of individuals.

For that, we can utilise a different assistance parser. We can use the function parser to import JSON keys. And what this will essentially do is look for a specific key, and the output will just contain that key. So we’ve slightly altered our extraction chain. We pass in our new helper parser, along with a key name and individuals to specify the field we wish to extract. Now, if we call it that, I’ll rename it. As you can see, we now only have the list. So, while it’s a tiny enhancement, it will make it easier to use downstream if extraction is indeed what we’re after for this purpose.

from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})
[{'name': 'Joe', 'age': 30}, {'name': 'Martha'}]

6 Applying Tagging and Extraction to Real-World Data

6.1 Loading and Analyzing an Article

We will demonstrate the application of our tagging and extraction models on an actual article, showcasing how to load the document, create the models, and execute the functions to retrieve structured data.

So, first, we’ll fill a load with a real article from the Internet. So we’re going to employ a web-based loader from the Langchain document loaders, which we discussed in a previous article. So we’ll enter this URL here, which leads to a fantastic blog post about autonomous agents. This is going to be called with load, and it will load a few documents.

Because it will only load one document, we will create a single document object to represent it. Then, because this is a lengthy document, we’re not going to print it all. So the first thing we’re going to do is get the first 10,000 characters with this. If we publish the first few paragraphs of this page’s content, we won’t be able to print the entire thing because it’s quite long. However, if we print the introduction, we can see that it is an article about empowering autonomous agents.

from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
documents = loader.load()
doc = documents[0]
page_content = doc.page_content[:10000]
print(page_content[:1000])






LLM Powered Autonomous Agents | Lil'Log







































Lil'Log






















Posts




Archive




Search




Tags




FAQ




emojisearch.app









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three: Tool Use

Case Studies

Scientific Discovery Agent

Generative Agents Simulation

Proof-of-Concept Examples


Challenges

Citation

References





Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general

6.2 Extracting Information from Large Texts

For longer documents, we introduce text splitting to manage the size limitation of the language model. By dividing the text into smaller sections, we can apply our extraction model sequentially and then collate the results for a comprehensive output.

First, we’ll make a class that describes what we wish to tag. So we’d like to get a high-level overview of this post. So we’d like to get a synopsis, the language utilised, and any keywords. So we’re going to build this great model to describe everything. Then we’ll start a chain reaction. So we’ll build the overview tagging function. Using this approach, you can transform one base model to an overview base model and then to an open-end function. We’ll then develop our tagging model. So we’ll identify the function we built before and compel it to call this overview.

class Overview(BaseModel):
    """Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.")
    language: str = Field(description="Provide the language that the content is written in.")
    keywords: str = Field(description="Provide keywords related to the content.")

After that, we’ll make a tagging chain. So we’re doing a prompt, which is the prompt from earlier. A query about tagging a model to a neighbouring output function.

overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]
tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}
)
tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()
tagging_chain.invoke({"input": page_content})
{'summary': 'This article discusses the concept of building autonomous agents powered by LLM (large language model) as their core controller. It explores the key components of such agent systems, including planning, memory, and tool use. It also covers various techniques for task decomposition and self-reflection in autonomous agents. The article provides examples of case studies and challenges in implementing LLM-powered agents.',
 'language': 'English',
 'keywords': 'LLM, autonomous agents, planning, memory, tool use, task decomposition, self-reflection, case studies, challenges'}

Now we’ll try to extract all of the papers referenced in this article. This piece is quite nice and highly intellectual, therefore it mentions a lot of papers, and we’re very curious about what those papers are. So we’ll start with some basic models. First, we want to know the title of the article, followed by the author, as we have done in the past. And then we’ll post this here. And by another class called information, and we’ll have papers, the elicited paper, so we can get a lot of things down.

class Paper(BaseModel):
    """Information about papers mentioned."""
    title: str
    author: Optional[str]


class Info(BaseModel):
    """Information to extract"""
    papers: List[Paper]

Then we’ll set up our extraction chain. So we’re generating the functions that we’ll pass in, which is basically information. We then bind that to the functions parameter, or we bind function call, and instead of designating it to the info, we force it to call this info function. Then we’ll insert our chain, the extraction model, into this JSON for the function parser, which we’ll have to rename to paper. So we’re going to make it and then run it on the page content again.

paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]
extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}
)
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")
extraction_chain.invoke({"input": page_content})
[{'title': 'LLM Powered Autonomous Agents', 'author': 'Lilian Weng'}]

So far, we’ve just gotten one result: title, author, and the only one. And so this is a little perplexing, because this is both the title of the article and the author of the piece that we’re passing in. It is the article itself that is mentioned, not the papers mentioned inside it.

And so the language model is probably going to get confused, because if you remember the initial page, there’s a lot of, this is the article title, this is the author, and we haven’t really instructed the language model too clearly that it should be extracting the papers that are mentioned within, rather than the information about the article itself. So, in order to correct this, we’re going to give it a little improved system message.

So we’ll say it more explicitly, an article we’ve passed snippets from all studies cited in this article. Do not remove any of the articles. If no papers are listed, that’s alright; if you’re turning the empty list, you don’t need to remove it. You are not required to invent anything. Any additional information only extracts what you require. As a result, we’re going to expand this prompt such that it’s much more descriptive of how the language model should act. We’ll utilise this prompt in our new chain; everything else is the same, so it’s the same extraction model and output parser.

And then we’ll call this new chain on that page content, and we’ll get back a list of articles with titles and authors, which will function much better.

template = """A article will be passed to you. Extract from it all papers that are mentioned by this article. 

Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.

Do not make up or guess ANY extra information. Only extract what exactly is in the text."""

prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", "{input}")
])
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")
extraction_chain.invoke({"input": page_content})
[{'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': 'Wei et al.'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': 'Liu et al.'},
 {'title': 'ReAct (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight (CoH; Liu et al. 2023)',
  'author': 'Liu et al.'},
 {'title': 'Algorithm Distillation (AD; Laskin et al. 2023)',
  'author': 'Laskin et al.'}]

As a result, these are all excellent papers in this field that are not the article itself, but rather the author’s attempt to convey a point. We can also do some sanity checks to ensure that it’s operating well, so we can pass in a simple message like, hello, we anticipate this to be a list and indeed it is, and so here are instructions for just returning empty lists if no papers are specified, that they’re working properly.

extraction_chain.invoke({"input": "hi"})
[]

So it appears to be making a good pass here. But keep in mind that this is only the first ten-five new characters in the article. What if we want to do it on the entire article and reach all of the publications that are cited in it? To do this, we’ll employ yet another notion, this time text splitting. So using a text splitter, and we’ll use the first character text splitter, which we studied in the previous post.

The thing we need to do is text split because this article is extremely long, and we try to pass that article if it’s language model straight because it will be too large for the token limit of the model. So we’re going to separate it into smaller parts of text, give those pieces of text to the language model individually, and then integrate all the results at the end.

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)

So let’s make some divides by calling split text on the document’s page content, and when we look at how many splits we have, we see that we have 14 separate splits. So what we’re going to try now is to establish a whole chain with link-in expression language. We’ll start with the page content and divide it into parts. We’ll then pass all of those individual splits to the extraction chain we described earlier, and then we’ll merge all of the results together.

splits = text_splitter.split_text(doc.page_content)
len(splits)
14

So, one thing we’ll undoubtedly need to do is write a function that can concatinate lists of lists. So we’re going to write this flatten function, which simply takes in a list of lists and flattens it. This is useful because we’ll be extracting a list of papers mentioned for each division and then combining them together.

def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list
flatten([[1, 2], [3, 4]])
[1, 2, 3, 4]
print(splits[0])
LLM Powered Autonomous Agents | Lil'Log







































Lil'Log






















Posts




Archive




Search




Tags




FAQ




emojisearch.app









      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


 


Table of Contents



Agent System Overview

Component One: Planning

Task Decomposition

Self-Reflection


Component Two: Memory

Types of Memory

Maximum Inner Product Search (MIPS)


Component Three: Tool Use

Case Studies

Scientific Discovery Agent

Generative Agents Simulation

Proof-of-Concept Examples


Challenges

Citation

References





Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory

Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
Long-term memory: This provides the agent with the capability to retain and recall (infinite) information over extended periods, often by leveraging an external vector store and fast retrieval.


Tool use

The agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information, code execution capability, access to proprietary information sources and more.

Another thing we’ll need to do is have a technique for preparing the splits that you pass into the chain. So keep in mind that the chain fits in an input variable, namely a dictionary with an input key. If we look at the first split in this list, it’s just text. So we’ll need a method to turn this collection of text into a list of dictionaries where that text is now the input key. We’ll accomplish this by declaring a function for it, because this will be the initial function in the chain.

We’ll encase it in a runnable lambda. So a runnable lambda is simply a langchain wrapper that accepts a function or a lambda and converts it to this runnable object. When you have functions that are the first member of the chain, you must perform this so that it can be correctly written together. So we’ll define this preprocessing function right here. So we’re building a runnable that takes an input and wants to pass the document as a rally and the page function as a document.

from langchain.schema.runnable import RunnableLambda
prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]
)

So the x in this case will be a string. And what we’re doing is writing a function that takes a string, splits it, and then creates a list of dictionaries, each of which is an input matching to the split. We can call it on a string and receive back a list of dictionaries if we play around with it and see what it does. Because the text splitter does not break it up, there is only one dictionary here. These x’s are really short. So it’s just taking this text, dividing it, and making a list of dictionaries.

prep.invoke("hi")
[{'input': 'hi'}]

And the reason this isn’t required is that this will be the input, so the following portion, the extraction chain, so we want to make a number of inputs there. So now we can start making our chain. So we’ll have this preparation function. We’d like to then transmit this along to the extraction chain. Remember that the extraction chain operates on a single element, and we have a list of items to pass in here. So, on the extraction chain, we can refer to it as a dot map.

chain = prep | extraction_chain.map() | flatten

And this basically means to take the previous input, which is a list of elements in this case, and map this chain over them. So, once again, this extraction chain will return to the list as we specify it, resulting in a list of lists. As a result, we’re going to call it flatten. And in this case, the usual function that we defined earlier can be used. Because it isn’t the first in the series, we don’t need to encapsulate it in a runable lambda. We could if we wanted to, but we don’t have to.

So we have this chain, and if we call chain.invoke, it will go through the entire page, content, and document. We can see that it will take some time, but it will eventually return with the answer. So it does two things: first, it limits it to 14 pieces, and then it passes that to the extraction chain. When it passes it to the extraction chain, it automatically parallelizes many of those calls. It parallelizes by default by five calls.

chain.invoke(doc.page_content)
[{'title': 'AutoGPT', 'author': ''},
 {'title': 'GPT-Engineer', 'author': ''},
 {'title': 'BabyAGI', 'author': ''},
 {'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': ''},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': ''},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': ''},
 {'title': 'ReAct (Yao et al. 2023)', 'author': ''},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': ''},
 {'title': 'Reflexion framework', 'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight', 'author': 'Liu et al.'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al.'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al. 2023'},
 {'title': 'ED (expert distillation)', 'author': ''},
 {'title': 'RL^2', 'author': 'Duan et al. 2017'},
 {'title': 'LSH: Locality-Sensitive Hashing', 'author': ''},
 {'title': 'ANNOY: Approximate Nearest Neighbors Oh Yeah', 'author': ''},
 {'title': 'HNSW: Hierarchical Navigable Small World', 'author': ''},
 {'title': 'FAISS: Facebook AI Similarity Search', 'author': ''},
 {'title': 'ScaNN: Scalable Nearest Neighbors', 'author': ''},
 {'title': 'MRKL: Modular Reasoning, Knowledge and Language',
  'author': 'Karpas et al. 2022'},
 {'title': 'TALM: Tool Augmented Language Models',
  'author': 'Parisi et al. 2022'},
 {'title': 'Toolformer', 'author': 'Schick et al. 2023'},
 {'title': 'HuggingGPT', 'author': 'Shen et al. 2023'},
 {'title': 'API-Bank: A Benchmark for Evaluating Tool-Augmented Language Models',
  'author': 'Li et al. 2023'},
 {'title': 'ChemCrow: Augmenting Language Models with Expert-Designed Tools for Scientific Discovery',
  'author': 'Bran et al. 2023'},
 {'title': 'Boiko et al. (2023)', 'author': 'Boiko et al.'},
 {'title': 'Generative Agents Simulation', 'author': 'Park, et al. 2023'},
 {'title': 'Park et al. 2023', 'author': ''},
 {'title': 'Super Mario: How Nintendo Conquered America',
  'author': 'Jeff Ryan'},
 {'title': 'Model-View-Controller (MVC) Explained', 'author': 'Techopedia'},
 {'title': 'Python Game Development: Creating a Snake Game',
  'author': 'Real Python'},
 {'title': 'Paper A', 'author': 'Author A'},
 {'title': 'Paper B', 'author': 'Author B'},
 {'title': 'Paper C', 'author': 'Author C'},
 {'title': 'Chain of thought prompting elicits reasoning in large language models.',
  'author': 'Wei et al.'},
 {'title': 'Tree of Thoughts: Deliberate Problem Solving with Large Language Models.',
  'author': 'Yao et al.'},
 {'title': 'Chain of Hindsight Aligns Language Models with Feedback',
  'author': 'Liu et al.'},
 {'title': 'LLM+P: Empowering Large Language Models with Optimal Planning Proficiency',
  'author': 'Liu et al.'},
 {'title': 'ReAct: Synergizing reasoning and acting in language models.',
  'author': 'Yao et al.'},
 {'title': 'Reflexion: an autonomous agent with dynamic memory and self-reflection',
  'author': 'Shinn & Labash'},
 {'title': 'In-context Reinforcement Learning with Algorithm Distillation',
  'author': 'Laskin et al.'},
 {'title': 'MRKL Systems A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.',
  'author': 'Karpas et al.'},
 {'title': 'API-Bank: A Benchmark for Tool-Augmented LLMs',
  'author': 'Li et al.'},
 {'title': 'HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace',
  'author': 'Shen et al.'},
 {'title': 'ChemCrow: Augmenting large-language models with chemistry tools.',
  'author': 'Bran et al.'},
 {'title': 'Emergent autonomous scientific research capabilities of large language models.',
  'author': 'Boiko et al.'},
 {'title': 'Generative Agents: Interactive Simulacra of Human Behavior.',
  'author': 'Joon Sung Park, et al.'}]

So it isn’t completely parallelizing, but it is speeding up the sequence in and out. When all of those calls have been completed, it will be handed to the final flatten function. And there we have it. It returns a list of extracted papers, so step the title and step the author. It had been skimmed through. For some of them, we can see that it leaves the author empty.

If you look at these references you will see we have document A and author A. This looks to be incorrect, but if you look at the article that this is referencing, you will see that it gives itself article as examples in code in that artical, in which it includes, among other things, extraction and retrieval of the generation. So there’s a lot of language in there that’s emulating some false papers and getting a response like this. As a result, they’re picking it up appropriately.

7 Conclusion: The Power of Structured Data Extraction

Tagging and extraction are powerful methods for turning unstructured text into structured data, opening up numerous possibilities for data analysis and insight generation. By understanding and applying these techniques through OpenAI’s functions and Langchain, developers can efficiently address common use cases and unlock the full potential of language models for data structuring tasks.

8 Acknowledgements

I’d like to express my thanks to the wonderful Functions, Tools and Agents with LangChain by DeepLearning.ai - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe