A News Article Summariser with OpenAI and Langchain

In this project we create a News Articles Summarizer application utilising ChatGPT and LangChain to help save time staying current on news and information in the fast-paced world of today
natural-language-processing
deep-learning
langchain
activeloop
openai
prompt-engineering
Author

Pranath Fernando

Published

July 29, 2023

1 Introduction

It’s critical to stay current on news and information in the fast-paced world of today. However, reading through numerous news pieces might take time. Let’s create a News Articles Summarizer application utilising ChatGPT and LangChain to help you save time and receive a brief overview of the key points. We can scrape articles from the internet, extract their titles and text, and produce succinct summaries using this robust programme. We will walk you through the process of creating a summarizer in this lecture. We will apply the ideas we covered in past sessions, showing how they would be used in a practical situation.

We will do this in the following steps:

  1. Installing necessary libraries: Before beginning, make sure you have requests, newspaper3k, and langchain installed.
  2. Scrape content: Utilise the requests library to scrape the targeted news articles’ content from the corresponding URLs.
  3. Titles and text of excerpts: Use the newspaper library to parse the HTML that was scraped and extract the article titles and text.
  4. Pre-process the text: To prepare the retrieved texts for input into ChatGPT, clean and preprocess them.
  5. Produce summaries: Use ChatGPT to condense the text of the extracted articles.
  6. Publish the results: Present the summaries with the original article titles to help readers immediately understand each piece’s important ideas.

By using this process, we may develop a useful News Articles Summarizer that makes use of ChatGPT to deliver insightful information in a time-saving manner. Stay informed and benefit from AI-powered summaries without having to spend hours going through large articles.

2 Import Libs & Setup

from dotenv import load_dotenv

!echo "OPENAI_API_KEY='<OPENAI_API_KEY>'" > .env

load_dotenv()
True

3 Get Articles

Using the requests library and a unique User-Agent header, the following code retrieves articles from a list of URLs. Using the newspaper collection, it then extracts the title and text of each article.

import requests
from newspaper import Article

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_urls = "https://www.artificialintelligence-news.com/2022/01/25/meta-claims-new-ai-supercomputer-will-set-records/"

session = requests.Session()

try:
    response = session.get(article_urls, headers=headers, timeout=10)

    if response.status_code == 200:
        article = Article(article_urls)
        article.download()
        article.parse()

        print(f"Title: {article.title}")
        print(f"Text: {article.text}")

    else:
        print(f"Failed to fetch article at {article_urls}")
except Exception as e:
    print(f"Error occurred while fetching article at {article_urls}: {e}")
Title: Meta claims its new AI supercomputer will set records
Text: Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it's geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

Meta (formerly Facebook) has unveiled an AI supercomputer that it claims will be the world’s fastest.

The supercomputer is called the AI Research SuperCluster (RSC) and is yet to be fully complete. However, Meta’s researchers have already begun using it for training large natural language processing (NLP) and computer vision models.

RSC is set to be fully built in mid-2022. Meta says that it will be the fastest in the world once complete and the aim is for it to be capable of training models with trillions of parameters.

“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” wrote Meta in a blog post.

“Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”

For production, Meta expects RSC will be 20x faster than Meta’s current V100-based clusters. RSC is also estimated to be 9x faster at running the NVIDIA Collective Communication Library (NCCL) and 3x faster at training large-scale NLP workflows.

A model with tens of billions of parameters can finish training in three weeks compared with nine weeks prior to RSC.

Meta says that its previous AI research infrastructure only leveraged open source and other publicly-available datasets. RSC was designed with the security and privacy controls in mind to allow Meta to use real-world examples from its production systems in production training.

What this means in practice is that Meta can use RSC to advance research for vital tasks such as identifying harmful content on its platforms—using real data from them.

“We believe this is the first time performance, reliability, security, and privacy have been tackled at such a scale,” says Meta.

(Image Credit: Meta)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo. The next events in the series will be held in Santa Clara on 11-12 May 2022, Amsterdam on 20-21 September 2022, and London on 1-2 December 2022.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

4 Set up OpenAI and Create Summaries

The next code creates a ChatOpenAI instance with a temperature of 0 for regulated response generation and imports crucial classes and functions from the LangChain. Additionally, it imports message schema classes for chat that facilitate the efficient handling of chat-based tasks. The prompt will be set and filled with the content of the article at the beginning of the next block of code.

from langchain.schema import (
    HumanMessage
)

# we get the article data from the scraping part
article_title = article.title
article_text = article.text

# prepare template for prompt
template = """You are a very good assistant that summarizes online articles.

Here's the article you want to summarize.

==================
Title: {article_title}

{article_text}
==================

Write a summary of the previous article.
"""

prompt = template.format(article_title=article.title, article_text=article.text)

messages = [HumanMessage(content=prompt)]

Within the chat-based interaction architecture, the user messages are represented by the HumanMessage, a structured data format. The AI model is communicated with using the ChatOpenAI class, and user communications are represented uniformly by the HumanMessage schema. The article’s title and content are placeholders in the template; the actual article_title and article_text will be used in their stead. By enabling you to establish a template with placeholders and then replace them with actual data as needed, this technique speeds and simplifies the construction of dynamic prompts.

from langchain.chat_models import ChatOpenAI

# load the model
chat = ChatOpenAI(model_name="gpt-4", temperature=0)

As soon as the model was loaded, the temperature was set to 0. The formatted prompt would be passed as a single HumanMessage object to the chat() instance, which would then provide a summary. This prompt is processed by the AI model, which provides a succinct summary:

# generate summary
summary = chat(messages)
print(summary.content)
Meta, formerly known as Facebook, has announced the development of an AI supercomputer called the AI Research SuperCluster (RSC). The supercomputer is expected to be completed by mid-2022 and aims to be the world's fastest, capable of training models with trillions of parameters. Meta's researchers are already using RSC for training large natural language processing and computer vision models. The company hopes that RSC will help build new AI systems for real-time voice translations and contribute to the development of the metaverse. RSC is designed with security and privacy controls, allowing Meta to use real-world examples from its production systems for training.

If we want a bulleted list, we can modify a prompt and get the result.

# prepare template for prompt
template = """You are an advanced AI assistant that summarizes online articles into bulleted lists.

Here's the article you need to summarize.

==================
Title: {article_title}

{article_text}
==================

Now, provide a summarized version of the article in a bulleted list format.
"""

# format prompt
prompt = template.format(article_title=article.title, article_text=article.text)

# generate summary
summary = chat([HumanMessage(content=prompt)])
print(summary.content)
- Meta (formerly Facebook) unveils AI Research SuperCluster (RSC), an AI supercomputer.
- RSC is claimed to be the world's fastest once fully built in mid-2022.
- Researchers have already started using RSC for training large NLP and computer vision models.
- The supercomputer aims to train models with trillions of parameters.
- RSC will help build AI systems for real-time voice translations and metaverse applications.
- Meta expects RSC to be 20x faster than its current V100-based clusters.
- RSC is designed with security and privacy controls to use real-world examples from Meta's production systems.
- The supercomputer will advance research for tasks like identifying harmful content on Meta's platforms.

You can tell the model to create the summary in French if you wish to get it in that language. Please be aware, nevertheless, that while GPT-4 is multilingual, English is its primary training language, and the quality may differ for languages other than English. The prompt can be changed as seen above.

# prepare template for prompt
template = """You are an advanced AI assistant that summarizes online articles into bulleted lists in French.

Here's the article you need to summarize.

==================
Title: {article_title}

{article_text}
==================

Now, provide a summarized version of the article in a bulleted list format, in French.
"""

# format prompt
prompt = template.format(article_title=article.title, article_text=article.text)

# generate summary
summary = chat([HumanMessage(content=prompt)])
print(summary.content)
- Meta (anciennement Facebook) dévoile un superordinateur IA qu'elle prétend être le plus rapide au monde.
- Le superordinateur s'appelle AI Research SuperCluster (RSC) et n'est pas encore totalement achevé.
- Les chercheurs de Meta l'utilisent déjà pour entraîner de grands modèles de traitement du langage naturel (NLP) et de vision par ordinateur.
- RSC devrait être entièrement construit d'ici mi-2022 et visera à entraîner des modèles avec des billions de paramètres.
- Meta espère que RSC permettra de créer de nouveaux systèmes d'IA pour des applications telles que la traduction vocale en temps réel pour des groupes de personnes parlant différentes langues.
- Pour la production, RSC devrait être 20 fois plus rapide que les clusters actuels de Meta basés sur V100.
- RSC est également estimé être 9 fois plus rapide pour exécuter la bibliothèque de communication collective NVIDIA (NCCL) et 3 fois plus rapide pour entraîner des flux de travail NLP à grande échelle.
- Un modèle avec des dizaines de milliards de paramètres peut terminer sa formation en trois semaines avec RSC, contre neuf semaines auparavant.
- Meta affirme que RSC a été conçu avec la sécurité et la confidentialité à l'esprit pour permettre d'utiliser des exemples réels de ses systèmes de production dans la formation.
- Cela signifie que Meta peut utiliser RSC pour faire progresser la recherche sur des tâches essentielles, telles que l'identification de contenus nuisibles sur ses plateformes, en utilisant des données réelles provenant de celles-ci.

5 Conclusion

In conclusion, we’ve shown how to use ChatGPT and LangChain’s features to build a powerful News Articles Summarizer. By collecting and condensing important information from a variety of publications into understandable, AI-generated summaries, this powerful tool makes it easier to stay informed. By turning these summaries into bulleted lists, the process has been improved and made easier to read and understand.

We’ve also broadened the scope of our summarizer to deliver summaries in many languages, with French serving as our model in response to the demands of a multilingual audience. This demonstrates the capability of our product to serve a varied, international audience.

The approach we’ve detailed is the core of this post; it’s a step-by-step tutorial that gives you the power to create your own summarizer. By doing this, you may speed up the process of consuming information, save a lot of time, and keep up with current events.

Additionally, we have studied the subtleties of prompt creation. The model must comprehend the objective, which in our instance included summarising an article into a list of bullet points and doing so in a different language, so a well-written prompt is essential. You can further modify the model to provide outcomes that meet your specific needs by understanding the subtleties of prompt design.

6 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe