LivingDataLab - Creating a custom text classifier for movie reviews

1 Introduction

In this article we are going to train a deep learning text classifier using the fastai library. We will do this for the IMDB movie reviews dataset. In particular, we will look at fastai’s ULMFit approach which involves fine tuning a language model more with specialised text before using this language model as a basis for a classification model.

2 Text Pre-processing

So how might we proceed with building a language model, that we can then use for clasisifcation? Consider with one of the simplest neural networks, a collaberative filtering model. This uses embedding matrices to encode different items (such as films) and users, combine these using dot products to calculate a value, which we test against known ratings - and use gradient descent to learn the correct embedding matrices to best predict these ratings.

Optionally, we can create instead a deep learning model from this by concatinating the embedding matrices instead of the dot product, then putting the result through an activtion function, and more layers etc.

So we could use a similar approach, where we put a sequence of words through a neural network via encoding them in an embedding martix for words. However a significant difference from the collaberative filtering approach here is the idea of a sequence.

We can proceed with these 5 steps:

Tokenisation: convert words to recognised units
Numericalisation: convert tokens to numbers
Create data loader: Create a data loader to train the language model which creates a target variable offset by one word from the input variable from the text data
Train language model: We need to train a model that can take an amount of text data of variable length, and be able to predict the next word for any word in the sequence.
Train classifier model: Using what the language model has learned about the text as a basis, we can build on top of this to create and train a language model.

This is an approach pioneered by fastai called the Universal Langauage Model Fine-tuining (ULMFit) approach.

2.1 Tokenisation

Lets get the data and tokenise it using the fastai library tools.

# Download data
path = untar_data(URLs.IMDB)

files = get_text_files(path, folders = ['train', 'test', 'unsup'])
# Show example text data
txt = files[0].open().read(); txt[:75]

'I caught up with this movie on TV after 30 years or more. Several aspects o'

Fastai has an english word tokeniser, lets see how it works.


# Test word tokeniser function
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#626) ['I','caught','up','with','this','movie','on','TV','after','30','years','or','more','.','Several','aspects','of','the','film','stood','out','even','when','viewing','it','so','many','years','after','it'...]


# Test word tokeniser class
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#699) ['xxbos','i','caught','up','with','this','movie','on','xxup','tv','after','30','years','or','more','.','xxmaj','several','aspects','of','the','film','stood','out','even','when','viewing','it','so','many','years'...]

The class goes beyond just converting the text to tokens for words, for example it creates tokens like ‘xxbos’ which is a special token to indicate the beginning of a new text sequence i.e. ‘beggining of stream’ standard NLP concept.

The class applies a series fo rules and transformations to the text, here is a list of them.

defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

2.2 Numericalisation


# Get first 2000 reviews to test
txts = L(o.open().read() for o in files[:2000])
# Tokenise
toks = tkn(txt)
# Select subset of tokenised reviews
toks200 = txts[:200].map(tkn)
num = Numericalize()
# Numericalise tokens - create a vocab
num.setup(toks200)
# Show first 20 tokens of vocab
coll_repr(num.vocab,20)

"(#2096) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','and','a','of','to','is','in','it','i'...]"


# Now we can convert tokens to numbers for example
nums = num(toks)[:20]; nums

TensorText([   2,   19,  726,   79,   29,   21,   32,   31,    7,  314,  112, 1195,  138,   63,   71,   10,    8,  393, 1524,   14])

2.3 Create data loader

So we need to join all the text together, and then divide it into specific sized batches of multiple lines of text of fixed length, which maintain the correct order of the text within each batch. At every epoch the order of the reviews is shuffled, but we then join these all together and construct mini-batches in order, which our model will process and learn from. This is all done automatically by the fastai library tools.


# Get some example numericalised tokens
nums200 = toks200.map(num)
# Pass to dataloader
dl = LMDataLoader(nums200)
# Get first batch of data and check sizes
x,y = first(dl)
x.shape,y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))


# Examine example input variable should be start of a text
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos i caught up with this movie on xxup tv after 30 years or more . xxmaj several aspects of'


# Examine example target variable which is the same plus added next word - this is what we want to predict
' '.join(num.vocab[o] for o in y[0][:20])

'i caught up with this movie on xxup tv after 30 years or more . xxmaj several aspects of the'

3 Training a text classifier

3.1 Fine tune language model

We can further simplify the text preparation for training our language model by combining the tokenisation, numericalisation and dataloader creation into one step by creating a TextBlock and then a dataloader.


# Create text dataloader for language model training
dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)


# Create a language model learner, by default will use x-entropy loss
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()
# Train model
learn.fit_one_cycle(1, 2e-2)
# Save model encoder
learn.save_encoder('finetuned')

3.2 Fine tune classifier model

To fine tune the classifier model we create the data loader in a slightly different way.


# Create text dataloader for classifier model training - using lm vocab
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)


# Create classifier learner
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()
# Load encoder from language model
learn = learn.load_encoder('finetuned')

When fine tuning the classifier, it is found to be best if we gradually unfreeze layers to train, and this is best done in manual steps. The first fit will just train the last layer.


# Train model - last layer only
learn.fit_one_cycle(1, 2e-2)


# Unfreeze a few more layers and train some more with discriminative learning rates
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))


# Unfreeze more layers and train more
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))


# Unfreeze whole model and train more
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

On this IMDB dataset we can achieve a classification accuracy of around 95% using this approach.

4 Conclusion

In this article we have looked in more detail at how we can train a text classifier using the 3 step ULMFit fastai approach, and achieve a good level of accuracy. We also saw in more detail what the fastai library does under the hood to make this process much easier.

Creating a custom text classifier for movie reviews

Subscribe