Creating Knowledge Graphs from Textual Data and LLM’s

Here we walk through a simple workflow for creating a knowledge graph from textual data, making complex information more accessible and easier to understand
natural-language-processing
deep-learning
langchain
activeloop
openai
prompt-engineering
network-analysis
Author

Pranath Fernando

Published

August 6, 2023

1 Introduction

Understanding the connections between various types of information is essential in today’s data-driven environment. Unstructured text may now be transformed into a structured network of items and their relationships using knowledge graphs, which have evolved as a potent tool for visualising and exploring these connections. We will walk you through a straightforward method for converting textual data into a knowledge graph, making complex content more approachable and understandable.

2 Import Libs & Setup

import os
import openai
import sys
sys.path.append('../..')
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

3 Workflow for Creating Knowledge Graphs from Textual Data

Here’s what we are going to do in this post.

4 Knowledge Graphs and Knowledge Bases: know the difference

It’s crucial to understand the distinction between knowledge bases and knowledge graphs before continuing.

Although the phrases “knowledge base” and “knowledge graph” are sometimes used interchangeably, they have slight variations. A knowledge base (KB) is a collection of organised data about a certain topic. A knowledge graph, on the other hand, is a knowledge base that is organised like a graph, with nodes denoting entities and edges denoting relationships between those things. For instance, we can extract the triplet of relations from the sentence “Fabio lives in Italy,” where “Fabio” and “Italy” are entities and “lives in” is their relationship.

A specific kind of information base is a knowledge graph. A knowledge graph is not required to be a knowledge base.

5 Building a Knowledge Graph

The process of building a knowledge graph usually consists of two sequential steps:

  1. Named Entity Recognition (NER): This step involves extracting entities from the text, which will eventually become the nodes of the knowledge graph.
  2. Relation Classification (RC): In this step, relations between entities are extracted, forming the edges of the knowledge graph.

The knowledge graph is then frequently displayed using tools like pyvis.

Usually, adding extra phases to the process of constructing a knowledge base from the text might improve it. For example:

  • Entity Linking: This involves normalizing entities to the same entity, such as “Napoleon” and “Napoleon Bonapart.” This is usually done by linking them to a canonical source, like a Wikipedia page.
  • Source Tracking: Keeping track of the origin of each relation, such as the article URL and text span. Keeping track of the sources allows us to gather insights into the reliability of the extracted information (e.g., a relation is accurate if it can be extracted from several sources considered accurate).

We’ll do the tasks of Named Entity Recognition and Relation Classification simultaneously in this project while using the relevant prompt. Relation Extraction (RE) is the popular name for this collaborative effort.

6 Building a Knowledge Graph with LangChain

We may use the KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT prompt as a starting point to show an example of using a prompt to extract relations from the text in LangChain. From a given word input, this prompt is intended to extract knowledge triples (subject, predicate, and object).

The ConversationEntityMemory class from the LangChain library, which allows chatbots to maintain a memory of the previous messages in a conversation by storing the relations retrieved from the previous messages, can use this prompt. In a subsequent course, memory classes will be explained. In this example, we don’t employ a memory class; instead, we only use this prompt to extract relationships from texts.

Let’s examine the KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT’s structure. The input variable text is used to create an instance of the PromptTemplate class for this prompt. The template is a string that gives the language model guidelines to follow when extracting knowledge triples from the input text along with a few sample examples. The OPENAI_API_KEY key from the environment variable where your OpenAI API key is saved is needed by the following code. Keep in mind to use the following command to install the necessary packages: pip install deeplake openai tiktoken langchain==0.0.208.

from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.graphs.networkx_graph import KG_TRIPLE_DELIMITER

# Prompt template for knowledge triple extraction
_DEFAULT_KNOWLEDGE_TRIPLE_EXTRACTION_TEMPLATE = (
    "You are a networked intelligence helping a human track knowledge triples"
    " about all relevant people, things, concepts, etc. and integrating"
    " them with your knowledge stored within your weights"
    " as well as that stored in a knowledge graph."
    " Extract all of the knowledge triples from the text."
    " A knowledge triple is a clause that contains a subject, a predicate,"
    " and an object. The subject is the entity being described,"
    " the predicate is the property of the subject that is being"
    " described, and the object is the value of the property.\n\n"
    "EXAMPLE\n"
    "It's a state in the US. It's also the number 1 producer of gold in the US.\n\n"
    f"Output: (Nevada, is a, state){KG_TRIPLE_DELIMITER}(Nevada, is in, US)"
    f"{KG_TRIPLE_DELIMITER}(Nevada, is the number 1 producer of, gold)\n"
    "END OF EXAMPLE\n\n"
    "EXAMPLE\n"
    "I'm going to the store.\n\n"
    "Output: NONE\n"
    "END OF EXAMPLE\n\n"
    "EXAMPLE\n"
    "Oh huh. I know Descartes likes to drive antique scooters and play the mandolin.\n"
    f"Output: (Descartes, likes to drive, antique scooters){KG_TRIPLE_DELIMITER}(Descartes, plays, mandolin)\n"
    "END OF EXAMPLE\n\n"
    "EXAMPLE\n"
    "{text}"
    "Output:"
)

KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT = PromptTemplate(
    input_variables=["text"],
    template=_DEFAULT_KNOWLEDGE_TRIPLE_EXTRACTION_TEMPLATE,
)

# Instantiate the OpenAI model
llm = OpenAI(model_name="text-davinci-003", temperature=0.9)

# Create an LLMChain using the knowledge triple extraction prompt
chain = LLMChain(llm=llm, prompt=KNOWLEDGE_TRIPLE_EXTRACTION_PROMPT)

# Run the chain with the specified text
text = "The city of Paris is the capital and most populous city of France. The Eiffel Tower is a famous landmark in Paris."
triples = chain.run(text)

print(triples)
 (Paris, is the capital of, France)<|>(Paris, is the most populous city of, France)<|>(Paris, has, Eiffel Tower)<|>(Eiffel Tower, is a, landmark)<|>(Eiffel Tower, is in, Paris)

Using few-shot samples, we used the prompt in the preceding code to extract related triplets from text. The created triplets will then be parsed and compiled into a list.

The knowledge triplets that were taken from the text will be in triples_response at this point. To parse the response and compile the triplets into a list, do the following:

def parse_triples(response, delimiter=KG_TRIPLE_DELIMITER):
    if not response:
        return []
    return response.split(delimiter)

triples_list = parse_triples(triples)

# Print the extracted relation triplets
print(triples_list)
[' (Paris, is the capital of, France)', '(Paris, is the most populous city of, France)', '(Paris, has, Eiffel Tower)', '(Eiffel Tower, is a, landmark)', '(Eiffel Tower, is in, Paris)']

In order to produce and visualise a knowledge graph from a list of related triplets, we first develop two functions; then, we utilised the triples_list to generate a list of cleaned triplets, which generates a NetworkX graph and converts it to a PyVis network. Edge concealing on drag, removing physics, and changing edge smoothing to “discrete” are additional ways it alters the graph’s visual appearance.

By using that method, we were able to create an interactive HTML file called knowledge_graph.html that contained the knowledge graph visualisation based on the extracted relation triplets:

from pyvis.network import Network
import networkx as nx

# Create a NetworkX graph from the extracted relation triplets
def create_graph_from_triplets(triplets):
    G = nx.DiGraph()
    for triplet in triplets:
        subject, predicate, obj = triplet.strip().split(',')
        G.add_edge(subject.strip(), obj.strip(), label=predicate.strip())
    return G

# Convert the NetworkX graph to a PyVis network
def nx_to_pyvis(networkx_graph):
    pyvis_graph = Network(notebook=True, cdn_resources='remote')
    for node in networkx_graph.nodes():
        pyvis_graph.add_node(node)
    for edge in networkx_graph.edges(data=True):
        pyvis_graph.add_edge(edge[0], edge[1], label=edge[2]["label"])
    return pyvis_graph

triplets = [t.strip() for t in triples_list if t.strip()]
graph = create_graph_from_triplets(triplets)
pyvis_network = nx_to_pyvis(graph)

# Customize the appearance of the graph
pyvis_network.toggle_hide_edges_on_drag(True)
pyvis_network.toggle_physics(False)
pyvis_network.set_edge_smooth('discrete')

# Show the interactive knowledge graph visualization
pyvis_network.show("knowledge_graph.html")
knowledge_graph.html

7 Conclusion

In this post, we’ve shown a simple yet effective method for generating knowledge graphs from textual input. To make complex information more accessible and understandable, we turned unstructured text into a structured network of things and their interactions.

It is important to point out that LangChain provides the GraphIndexCreator class, which automates the extraction of connection triplets and integrates neatly with question-answering chains. Future articles will go into greater detail about this useful feature and demonstrate how it can improve your ability to create and analyse knowledge graphs.

As a useful tool for visualising intricate relationships, the knowledge graph produced by this approach also provides access to further investigation, pattern identification, and data-driven decision-making.

Further reading:

https://medium.com/nlplanet/building-a-knowledge-base-from-texts-a-full-practical-example-8dbbffb912fa

https://apex974.com/articles/explore-langchain-support-for-knowledge-graph

8 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe