US Patent Phrase to Phrase Matching

In this project I will create a model that can associate short text phrases with the correct US patent classification.
fastai
fastai-2022
hugging-face
deep-learning
natural-language-processing
Author

Pranath Fernando

Published

December 10, 2022

1 Introduction

In this series of articles I will be re-visiting the FastAI Practical Deep Learning for Coders course for this year 2022 which I have completed in previous years. This article covers lesson 4 of this years course, which I will use to create model that can associate short phrases with the correct US patent classification.

While this is based on a fastai training course, in this particular project we will not actually be using the fastai library, we will be using the Hugging Face Transformers Library which is a python library of state of the art deep learning models, including the very powerful transformers model architecture behind so many of the recent advances in AI. Fastai does also integrate transfomer models as well.

First we will import the required libraries.

2 Import Libraries

import pandas as pd
import numpy as np
from datasets import Dataset,DatasetDict
import datasets
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments,Trainer

3 The Project: US Patent Phrase to Phrase Matching

The U.S. Patent and Trademark Office (USPTO) offers one of the largest repositories of scientific, technical, and commercial information in the world through its Open Data Portal. Patents are a form of intellectual property granted in exchange for the public disclosure of new and useful inventions. Because patents undergo an intensive vetting process prior to grant, and because the history of U.S. innovation spans over two centuries and 11 million patents, the U.S. patent archives stand as a rare combination of data volume, quality, and diversity.

In this project, I will train a model on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents. Determining the semantic similarity between phrases is critically important during the patent search and examination process to determine if an invention has been described before.

For example, if one invention claims “television set” and a prior publication describes “TV set”, a model would ideally recognize these are the same and assist a patent attorney or examiner in retrieving relevant documents. This extends beyond paraphrase identification; if one invention claims a “strong material” and another uses “steel”, that may also be a match. What counts as a “strong material” varies per domain (it may be steel in one domain and ripstop fabric in another, but you wouldn’t want your parachute made of steel).

We will seek to build a model to match phrases in order to extract contextual information, which could help the patent community connect the dots between millions of patent documents.

Specifically, we will be comparing two words or short phrases, and scoring them based on whether they’re similar or not, based on which patent class they were used in. With a score of 1 it is considered that the two inputs have identical meaning, and 0 means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5, meaning they’re somewhat similar, but not identical.

It turns out that this can be represented as a classification problem. How? By representing the question like this:

For the following text…: “TEXT1: abatement; TEXT2: eliminating process” …chose a category of meaning similarity: “Different; Similar; Identical”.

In this project we’ll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.

The dataset comes from this kaggle project.

4 Get Data

Let’s first download and extract our data.

!unzip us-patent-phrase-to-phrase-matching.zip
!ls
Archive:  us-patent-phrase-to-phrase-matching.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               
drive        sample_submission.csv  train.csv
sample_data  test.csv           us-patent-phrase-to-phrase-matching.zip
df = pd.read_csv('train.csv')
df.head()
id anchor target context score
0 37d61fd2272659b1 abatement abatement of pollution A47 0.50
1 7b9652b17b68b7a4 abatement act of abating A47 0.75
2 36d72442aefd8232 abatement active catalyst A47 0.25
3 5296b0c19e1ce60e abatement eliminating process A47 0.50
4 54c1e3b9184cb5b6 abatement forest region A47 0.00

The dataset description gives a clearer idea of what these different fields mean.

For example:

  • id - a unique identifier for a pair of phrases
  • anchor - the first phrase
  • target - the second phrase
  • context - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
  • score - the similarity. This is sourced from a combination of one or more manual expert ratings.

Lets generate some basic summary stats for each field.

df.describe(include='object')
id anchor target context
count 36473 36473 36473 36473
unique 36473 733 29340 106
top 37d61fd2272659b1 component composite coating composition H01
freq 1 152 24 2186

We can see that we have far fewer anchors than targets, and that some of these anchors are very common for example ‘component composite coating’ is associated with 152 different targets.

It was suggested earlier that we could represent the input to the model as something like “TEXT1: abatement; TEXT2: eliminating process”. We’ll need to add the context to this too. In Pandas, we just use + to concatenate, like so:

df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df['input'].head()
0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

5 Text Data Transformation

The Hugging Face transformers library uses the Dataset object to store data, lets create one for our data.

ds = Dataset.from_pandas(df)
ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

So we have our text data, but there is a problem. Machine learning and AI models don’t actually understand text! They can only understand numbers. So we need a way to convert our text data into a numerical representation.

The branch of machine learning and AI concerned with understanding language is called Natural Language Processing or NLP. In NLP we prepare text data for machine learning by converting it into numbers, two common steps are followed:

  • Tokenization: Split each text up into words (or actually, as we’ll see, into tokens)
  • Numericalization: Convert each word (or token) into a number.

The details about how this is done actually depends on the particular model we use. So first we’ll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use a smaller model, then working up to a bigger model later.

Why? It’s true that in deep learning and AI, a larger model generally does better than a smaller model. However a smaller model is quicker to train and experiment with multiple times which is better when we are just trying things out at the start and need to iterate rapidly, and can give an idea of some kind of baseline we can expect to improve on with a bigger model.

We will use this small model.

model_nm = 'microsoft/deberta-v3-small'

AutoTokenizer will create a tokenizer appropriate for a given model:

tokz = AutoTokenizer.from_pretrained(model_nm)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.8/dist-packages/transformers/convert_slow_tokenizer.py:446: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Here’s an example of how the tokenizer splits a text into “tokens” (which are like words, but can be sub-word pieces, as you see below):

tokz.tokenize("Hi my name is Pranath !")
['▁Hi', '▁my', '▁name', '▁is', '▁Prana', 'th', '▁!']

Uncommon words will be split into pieces. The start of a new word is represented by ▁:

tokz.tokenize("A platypus is an ornithorhynchus anatinus.")
['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

Here’s a simple function which tokenizes our inputs:

def tok_func(x): return tokz(x["input"])

tok_ds = ds.map(tok_func, batched=True)

This adds a new item to our dataset called input_ids. For instance, here is the input and IDs for the first row of our data:

row = tok_ds[0]
row['input'], row['input_ids']
('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

So, what are those IDs and where do they come from? The secret is that there’s a list called vocab in the tokenizer which contains a unique integer for every possible token string. We can look them up like this, for instance to find the token for the word “of”:

tokz.vocab['▁of']
265

Looking above at our input IDs, we see that 265 appears as expected.

Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name labels, but in our dataset it’s currently called score. Therefore, we need to rename it:

tok_ds = tok_ds.rename_columns({'score':'labels'})

Now that we’ve prepared our tokens and labels, we need to create our validation set.

6 Test and Validation Sets

You may have noticed that our directory contained another file for our test set.

eval_df = pd.read_csv('test.csv')
eval_df.describe()
id anchor target context
count 36 36 36 36
unique 36 34 36 29
top 4112d61851461f60 el display inorganic photoconductor drum G02
freq 1 2 1 3

Transformers uses a DatasetDict for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, we use train_test_split:

dds = tok_ds.train_test_split(0.25, seed=42)
dds
DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

As you see above, the validation set here is called test and not validate, so we need to be careful we don’t confuse ourselves with terminology!

We will use the separate test set at the end to check our predictions, whereas the validation set will be used during the model training to check our progress.

We’ll use eval as our name for the test set, to avoid confusion with the test dataset that was created above.

eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

7 Model Training

To train our model we need to pick a batch size that fits our GPU, and small number of epochs so we can run experiments quickly.

bs = 128
epochs = 4
lr = 8e-5

The most important hyperparameter for model training is the learning rate. Fastai provides a learning rate finder to help you figure this out, but Hugging Face Transformers doesn’t, so we just have to use trial and error. The idea is to find the largest value you can, but which doesn’t result in training failing.

We will also need to define some functions for our model metric, which is how we measure how well our model is performing. For this we will be using Pearsons Correlation Coefficient as a measure of similarity between the anchor and target texts.

def corr(x,y): return np.corrcoef(x,y)[0][1]

def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

Transformers uses the TrainingArguments class to set up model training hyper-parameter arguments.

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

We can now create our model, and Trainer, which is a class which combines the data and model together (just like Learner in fastai):

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.dense.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.dense.weight', 'mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.bias', 'classifier.weight', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using cuda_amp half precision backend

Let’s train our model!

trainer.train();
The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: input, context, anchor, target, id. If input, context, anchor, target, id are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 27354
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 856
  Number of trainable parameters = 141895681
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: input, context, anchor, target, id. If input, context, anchor, target, id are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: input, context, anchor, target, id. If input, context, anchor, target, id are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 256
Saving model checkpoint to outputs/checkpoint-500
Configuration saved in outputs/checkpoint-500/config.json
Model weights saved in outputs/checkpoint-500/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-500/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: input, context, anchor, target, id. If input, context, anchor, target, id are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 256
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: input, context, anchor, target, id. If input, context, anchor, target, id are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 256


Training completed. Do not forget to share your model on huggingface.co/models =)

[856/856 03:39, Epoch 4/4]
Epoch Training Loss Validation Loss Pearson
1 No log 0.023299 0.827306
2 No log 0.022970 0.831413
3 0.014000 0.022094 0.831611
4 0.014000 0.022278 0.831688

[215/856 00:52 < 02:37, 4.08 it/s, Epoch 1/4]
Epoch Training Loss Validation Loss

[36/36 03:48]

Lots of warning messages from Transformers – we can ignore these.

The key thing to look at is the “Pearson” value in table above. As we can see, it’s increasing, and is already above 0.8. It looks like we have a model that can predict with high accuracy for these patent text phrases.

8 Generate Predictions for US Patent Phrases

Let’s get some predictions on the test set.

preds = trainer.predict(eval_ds).predictions.astype(float)
preds
The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: input, context, anchor, target, id. If input, context, anchor, target, id are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 36
  Batch size = 256
array([[ 5.01464844e-01],
       [ 6.09863281e-01],
       [ 6.35742188e-01],
       [ 2.67578125e-01],
       [-2.59160995e-04],
       [ 5.31738281e-01],
       [ 4.78515625e-01],
       [-4.77981567e-03],
       [ 2.24121094e-01],
       [ 1.07910156e+00],
       [ 2.25463867e-01],
       [ 2.15087891e-01],
       [ 7.56347656e-01],
       [ 8.77929688e-01],
       [ 7.44628906e-01],
       [ 3.58642578e-01],
       [ 2.76855469e-01],
       [-7.08770752e-03],
       [ 6.49414062e-01],
       [ 3.75488281e-01],
       [ 4.80468750e-01],
       [ 2.20336914e-01],
       [ 2.38159180e-01],
       [ 1.93481445e-01],
       [ 5.60546875e-01],
       [ 1.14746094e-02],
       [-7.29751587e-03],
       [-9.97924805e-03],
       [-8.94165039e-03],
       [ 6.04492188e-01],
       [ 3.15673828e-01],
       [ 1.96685791e-02],
       [ 7.78808594e-01],
       [ 4.83886719e-01],
       [ 4.22363281e-01],
       [ 1.96655273e-01]])

Looking at these predictions something is not quite right. The Pearson’s correlation coefficient should have a value (for our case) between 0 and 1, but some values of our predictions are less than zero and bigger than 1.

This once again shows the value of remembering to actually look at your data. Let’s fix those out-of-bounds predictions:

preds = np.clip(preds, 0, 1)
preds
array([[0.50146484],
       [0.60986328],
       [0.63574219],
       [0.26757812],
       [0.        ],
       [0.53173828],
       [0.47851562],
       [0.        ],
       [0.22412109],
       [1.        ],
       [0.22546387],
       [0.21508789],
       [0.75634766],
       [0.87792969],
       [0.74462891],
       [0.35864258],
       [0.27685547],
       [0.        ],
       [0.64941406],
       [0.37548828],
       [0.48046875],
       [0.22033691],
       [0.23815918],
       [0.19348145],
       [0.56054688],
       [0.01147461],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.60449219],
       [0.31567383],
       [0.01966858],
       [0.77880859],
       [0.48388672],
       [0.42236328],
       [0.19665527]])

We now have our predictions for the patent phrase pairs which should have a high accruacy from our results.

Subscribe