An Improved News Articles Summarizer

Our goal in this post is to improve a news summarisers ability to extract the most important information from lengthy news items and display it in an easy-to-read bulleted list format
natural-language-processing
deep-learning
langchain
activeloop
openai
prompt-engineering
Author

Pranath Fernando

Published

August 5, 2023

1 Introduction

This article aims to improve our earlier News Article Summarizer implementation. Our goal is to improve our tool’s ability to extract the most important information from lengthy news items and display it in an easy-to-read, bulleted list format. With this improvement, consumers will be able to quickly and clearly understand the essential ideas of an article, saving time and improving the reading experience.

We will change our current summarizer to tell the underlying language model to produce summaries as bulleted lists in order to do this. We need to make a few adjustments to the way we give our cue to the model for this assignment, and the workflow below will walk you through them.

2 Workflow for Building a News Articles Summarizer with Bulleted Lists

This is what we are going to doin this project.

We set up the environment and retrieved the news article.

  1. Install required libraries: The first step is to ensure that the necessary libraries, namely requests, newspaper3k, and LangChain, are installed.
  2. Scrape articles: We will use the requests library to scrape the content of the target news articles from their respective URLs.
  3. Extract titles and text: The newspaper library will be used to parse the scraped HTML, extracting the titles and text of the articles.
  4. Preprocess the text: The extracted texts need to be cleaned and preprocessed to make them suitable for input to LLM.

The rest of the post will explore new possibilities to enhance the application’s performance further.

  1. Use Few-Shot Learning Technique: We use the few-shot learning technique in this step. This template will provide a few examples of the language model to guide it in generating the summaries in the desired format - a bulleted list.
  2. Generate summaries: With the modified prompt, we utilize the model to generate concise summaries of the extracted articles’ text in the desired format.
  3. Use the Output Parsers: We employ the Output Parsers to interpret the output from the language model, ensuring it aligns with our desired structure and format.
  4. Output the results: Finally, we present the bulleted summaries along with the original titles, enabling users to quickly grasp the main points of each article in a structured manner.

By following these instructions, you may create a robust programme that can summarise news items into digestible, bulleted summaries while also using OutputParsers to arrange the output according to a specified data structure and the FewShotLearning technique for increased precision. Let’s get started!

Technically, the first phases of the procedure are the same as in part 1 of this tutorial. Keep in mind to use the following command to install the necessary packages: pip install deeplake openai tiktoken langchain==0.0.208. Install the newspaper3k package as well, which was examined in this session with the 0.2.8 version.

3 Import Libs & Setup

from dotenv import load_dotenv

!echo "OPENAI_API_KEY='<OPENAI_API_KEY>'" > .env

load_dotenv()
True

To create a summary, we used the URL of a news story. The code that follows uses a customised User-Agent header together with the requests library to fetch articles from a list of URLs. The title and content of each article are then extracted using the newspaper library.

import requests
from newspaper import Article

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_url = "https://www.artificialintelligence-news.com/2022/01/25/meta-claims-new-ai-supercomputer-will-set-records/"

session = requests.Session()


try:
    response = session.get(article_url, headers=headers, timeout=10)

    if response.status_code == 200:
        article = Article(url)
        article.download()
        article.parse()

        print(f"Title: {article.title}")
        print(f"Text: {article.text}")
    else:
        print(f"Failed to fetch article at {url}")
except Exception as e:
    print(f"Error occurred while fetching article at {url}: {e}")
Title: Meta claims its new AI supercomputer will set records
Text: Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it's geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

Meta (formerly Facebook) has unveiled an AI supercomputer that it claims will be the world’s fastest.

The supercomputer is called the AI Research SuperCluster (RSC) and is yet to be fully complete. However, Meta’s researchers have already begun using it for training large natural language processing (NLP) and computer vision models.

RSC is set to be fully built in mid-2022. Meta says that it will be the fastest in the world once complete and the aim is for it to be capable of training models with trillions of parameters.

“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” wrote Meta in a blog post.

“Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”

For production, Meta expects RSC will be 20x faster than Meta’s current V100-based clusters. RSC is also estimated to be 9x faster at running the NVIDIA Collective Communication Library (NCCL) and 3x faster at training large-scale NLP workflows.

A model with tens of billions of parameters can finish training in three weeks compared with nine weeks prior to RSC.

Meta says that its previous AI research infrastructure only leveraged open source and other publicly-available datasets. RSC was designed with the security and privacy controls in mind to allow Meta to use real-world examples from its production systems in production training.

What this means in practice is that Meta can use RSC to advance research for vital tasks such as identifying harmful content on its platforms—using real data from them.

“We believe this is the first time performance, reliability, security, and privacy have been tackled at such a scale,” says Meta.

(Image Credit: Meta)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo. The next events in the series will be held in Santa Clara on 11-12 May 2022, Amsterdam on 20-21 September 2022, and London on 1-2 December 2022.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

4 Few Shot Prompting

We learned how to utilise FewShotPromptTemplate in the previous posts; now, let’s explore an other method of adding examples to a prompt that is slightly different but achieves the same effects. In this experiment, we provide a number of examples that direct the model’s process of summarising to produce bullet lists. As a result, it is anticipated that the model will provide a bulleted list that summarises the provided article.

from langchain.schema import (
    HumanMessage
)

# we get the article data from the scraping part
article_title = article.title
article_text = article.text

# prepare template for prompt
template = """
As an advanced AI, you've been tasked to summarize online articles into bulleted points. Here are a few examples of how you've done this in the past:

Example 1:
Original Article: 'The Effects of Climate Change
Summary:
- Climate change is causing a rise in global temperatures.
- This leads to melting ice caps and rising sea levels.
- Resulting in more frequent and severe weather conditions.

Example 2:
Original Article: 'The Evolution of Artificial Intelligence
Summary:
- Artificial Intelligence (AI) has developed significantly over the past decade.
- AI is now used in multiple fields such as healthcare, finance, and transportation.
- The future of AI is promising but requires careful regulation.

Now, here's the article you need to summarize:

==================
Title: {article_title}

{article_text}
==================

Please provide a summarized version of the article in a bulleted list format.
"""

# format prompt
prompt = template.format(article_title=article.title, article_text=article.text)

messages = [HumanMessage(content=prompt)]

These examples help the model comprehend the type of responses we desire from it. Here are a few significant elements:

  • Article data: The title and text of the article are obtained, which will be used as inputs to the model.

  • Template preparation: A template is prepared for the prompt. This template includes a few-shot learning style, where the model is provided with examples of how it has previously converted articles into a bulleted list format. The template also includes placeholders for the actual article title and text that will be summarized. Then, the placeholders in the template ({article_title} and {article_text}) are replaced with the actual title and text of the article using the .format() method.

The GPT-4 model is then loaded using the ChatOpenAI class in order to provide the summary. The prepared prompt is then given as input/prompt to the language model. A HumanMessage list is accepted as an input argument by the chat instance of the ChatOpenAI class.

from langchain.chat_models import ChatOpenAI

# load the model
chat = ChatOpenAI(model_name="gpt-4", temperature=0)
# generate summary
summary = chat(messages)
print(summary.content)
- Meta (formerly Facebook) has unveiled an AI supercomputer called the AI Research SuperCluster (RSC).
- The RSC is yet to be fully complete but is already being used for training large natural language processing (NLP) and computer vision models.
- Meta claims that the RSC will be the fastest in the world once complete and capable of training models with trillions of parameters.
- The aim is for the RSC to help build entirely new AI systems that can power real-time voice translations to large groups of people.
- Meta expects the RSC to be 20x faster than its current V100-based clusters for production.
- The RSC is estimated to be 9x faster at running the NVIDIA Collective Communication Library (NCCL) and 3x faster at training large-scale NLP workflows.
- Meta says that its previous AI research infrastructure only leveraged open source and other publicly-available datasets.
- RSC was designed with security and privacy controls in mind to allow Meta to use real-world examples from its production systems in production training.
- Meta can use RSC to advance research for vital tasks such as identifying harmful content on its platforms using real data from them.

The utilisation of a few-shot learning approach in the prompt is the main takeaway from this attempt. This gives the model examples of how to carry out the task, directing it to create a bulleted list that summarises the article. You can alter the output of the model to satisfy different needs and make sure it adheres to a specific format, tone, style, etc. by changing the prompt and the examples.

5 Output Parsers

Let’s now advance by utilising output parsers. The LangChain Pydantic output parser provides a flexible mechanism to shape language model outputs in accordance with pre-defined schemas. It allows for more structured interactions with language models and makes it simpler to extract and use the data the model provides when used in conjunction with prompt templates.

Our parser’s format recommendations are included in the prompt template, which directs the language model to generate the output in the appropriate format. The goal is to show how, rather than receiving the output as a string, you might use the PydanticOutputParser class to receive it as a type List that contains each bullet point. A list’s benefit is the ability to loop through the results or index a particular item.

As previously indicated, a parser that will convert the output from the string into a data structure is made using the PydanticOutputParser wrapper. The model’s output will be analysed using the custom ArticleSummary class, which derives from BaseModel in the Pydantic package.

Using the Field object, we established the schema to display a title and a summary variable that contains a list of text. Each variable must represent something, and the description argument will explain this and assist the model in doing so. Additionally, a validator function is included in our own class to guarantee that the output is generated with at least three bullet points.

from langchain.output_parsers import PydanticOutputParser
from pydantic import validator
from pydantic import BaseModel, Field
from typing import List


# create output parser class
class ArticleSummary(BaseModel):
    title: str = Field(description="Title of the article")
    summary: List[str] = Field(description="Bulleted list summary of the article")

    # validating whether the generated summary has at least three lines
    @validator('summary')
    def has_three_or_more_lines(cls, list_of_lines):
        if len(list_of_lines) < 3:
            raise ValueError("Generated summary has less than three bullet points!")
        return list_of_lines

# set up output parser
parser = PydanticOutputParser(pydantic_object=ArticleSummary)

The next step is to design a template for the input prompt that tells the language model how to bullet point the news story. The prompts that are provided to the language model are correctly formatted using a PromptTemplate object, which is created using this template. The.get_format_instructions() method of the PromptTemplate, which will also contain extra instructions on how the output should be structured, is used to format the prompt delivered to the language model using our unique parser.

from langchain.prompts import PromptTemplate


# create prompt template
# notice that we are specifying the "partial_variables" parameter
template = """
You are a very good assistant that summarizes online articles.

Here's the article you want to summarize.

==================
Title: {article_title}

{article_text}
==================

{format_instructions}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["article_title", "article_text"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

# Format the prompt using the article title and text obtained from scraping
formatted_prompt = prompt.format_prompt(article_title=article_title, article_text=article_text)

Last but not least, the GPT-3 model is initialised with the temperature set to 0.0, meaning the output will be deterministic and favour the most likely result over unpredictability or innovation. Using the.parse() method, the parser object subsequently transforms the model’s output text into a specified schema.

from langchain.llms import OpenAI


# instantiate model class
model = OpenAI(model_name="text-davinci-003", temperature=0.0)

# Use the model to generate a summary
output = model(formatted_prompt.to_string())

# Parse the output into the Pydantic model
parsed_output = parser.parse(output)
print(parsed_output)
parsed_output
ArticleSummary(title='Meta claims its new AI supercomputer will set records', summary=['Meta (formerly Facebook) has unveiled an AI supercomputer that it claims will be the world’s fastest.', 'The supercomputer is called the AI Research SuperCluster (RSC) and is yet to be fully complete.', 'Meta says that it will be the fastest in the world once complete and the aim is for it to be capable of training models with trillions of parameters.', 'For production, Meta expects RSC will be 20x faster than Meta’s current V100-based clusters.', 'Meta says that its previous AI research infrastructure only leveraged open source and other publicly-available datasets.', 'What this means in practice is that Meta can use RSC to advance research for vital tasks such as identifying harmful content on its platforms—using real data from them.'])

A potent technique for shaping and organising the output from language models is the Pydantic output parser. It establishes and enforces data schemas for the model’s output using the Pydantic library, which is renowned for its data validation skills.

Here’s an overview of what we did:

  • We created the ArticleSummary Pydantic data structure. This model acts as a guide for the structure that the generated article summary should have. It has fields for the summary and title, each of which should contain a list of strings that correspond to bullet points. In order to preserve a particular amount of depth in the summarization, it is crucial that we include a validator within this model to make sure the summary has at least three points.
  • Next, we use our ArticleSummary class to create a parser object. This parser is essential in ensuring that the language model’s output adheres to the specified structures of our unique schema.
  • We develop the prompt template to control the language model’s output. By adding the parser object, the template informs the model to serve as a helper that summarises internet content.
  • So, output parsers make it simpler to extract useful information from model replies by allowing us to specify the intended format of the model’s output.

6 Conclusion

In this post, we’ve demonstrated the possibilities of prompt handling in LangChain by building our News Articles Summarizer using the potential of PromptTemplates and OutputParsers. A potent technique for shaping and organising the output from language models is the Pydantic output parser. It establishes and enforces data schemas for the model’s output using the Pydantic library, which is renowned for its data validation skills.

This is followed by the definition of the Pydantic model “ArticleSummary.” This model acts as a guide for the structure that the generated article summary should have. It has fields for the summary and title, each of which should contain a list of strings that correspond to bullet points. In order to preserve a particular amount of depth in the summarization, it is crucial that we include a validator within this model to make sure the summary has at least three points.

The “ArticleSummary” model is then given a PydanticOutputParser that we just created. This parser is essential in ensuring that the language model’s output adheres to the structure described in the “Article Summary” model.

If you have a solid grasp of the subtleties involved in prompt and output design, you can adapt the model to deliver outcomes that address your unique needs

7 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe