Creating a custom text classifier for movie reviews
In this article we are going to create a deep learning text classifier using the fastai library, and the ULMFit approach.
projects
natural-language-processing
deep-learning
fastai
Author
Pranath Fernando
Published
May 29, 2021
1 Introduction
In this article we are going to train a deep learning text classifier using the fastai library. We will do this for the IMDB movie reviews dataset. In particular, we will look at fastai’s ULMFit approach which involves fine tuning a language model more with specialised text before using this language model as a basis for a classification model.
2 Text Pre-processing
So how might we proceed with building a language model, that we can then use for clasisifcation? Consider with one of the simplest neural networks, a collaberative filtering model. This uses embedding matrices to encode different items (such as films) and users, combine these using dot products to calculate a value, which we test against known ratings - and use gradient descent to learn the correct embedding matrices to best predict these ratings.
Optionally, we can create instead a deep learning model from this by concatinating the embedding matrices instead of the dot product, then putting the result through an activtion function, and more layers etc.
So we could use a similar approach, where we put a sequence of words through a neural network via encoding them in an embedding martix for words. However a significant difference from the collaberative filtering approach here is the idea of a sequence.
We can proceed with these 5 steps:
Tokenisation: convert words to recognised units
Numericalisation: convert tokens to numbers
Create data loader: Create a data loader to train the language model which creates a target variable offset by one word from the input variable from the text data
Train language model: We need to train a model that can take an amount of text data of variable length, and be able to predict the next word for any word in the sequence.
Train classifier model: Using what the language model has learned about the text as a basis, we can build on top of this to create and train a language model.
This is an approach pioneered by fastai called the Universal Langauage Model Fine-tuining (ULMFit) approach.
2.1 Tokenisation
Lets get the data and tokenise it using the fastai library tools.
# Download datapath = untar_data(URLs.IMDB)files = get_text_files(path, folders = ['train', 'test', 'unsup'])# Show example text datatxt = files[0].open().read(); txt[:75]
'I caught up with this movie on TV after 30 years or more. Several aspects o'
Fastai has an english word tokeniser, lets see how it works.
# Test word tokeniser functionspacy = WordTokenizer()toks = first(spacy([txt]))print(coll_repr(toks, 30))
The class goes beyond just converting the text to tokens for words, for example it creates tokens like ‘xxbos’ which is a special token to indicate the beginning of a new text sequence i.e. ‘beggining of stream’ standard NLP concept.
The class applies a series fo rules and transformations to the text, here is a list of them.
# Get first 2000 reviews to testtxts = L(o.open().read() for o in files[:2000])# Tokenisetoks = tkn(txt)# Select subset of tokenised reviewstoks200 = txts[:200].map(tkn)num = Numericalize()# Numericalise tokens - create a vocabnum.setup(toks200)# Show first 20 tokens of vocabcoll_repr(num.vocab,20)
So we need to join all the text together, and then divide it into specific sized batches of multiple lines of text of fixed length, which maintain the correct order of the text within each batch. At every epoch the order of the reviews is shuffled, but we then join these all together and construct mini-batches in order, which our model will process and learn from. This is all done automatically by the fastai library tools.
# Get some example numericalised tokensnums200 = toks200.map(num)# Pass to dataloaderdl = LMDataLoader(nums200)# Get first batch of data and check sizesx,y = first(dl)x.shape,y.shape
(torch.Size([64, 72]), torch.Size([64, 72]))
# Examine example input variable should be start of a text' '.join(num.vocab[o] for o in x[0][:20])
'xxbos i caught up with this movie on xxup tv after 30 years or more . xxmaj several aspects of'
# Examine example target variable which is the same plus added next word - this is what we want to predict' '.join(num.vocab[o] for o in y[0][:20])
'i caught up with this movie on xxup tv after 30 years or more . xxmaj several aspects of the'
3 Training a text classifier
3.1 Fine tune language model
We can further simplify the text preparation for training our language model by combining the tokenisation, numericalisation and dataloader creation into one step by creating a TextBlock and then a dataloader.
# Create text dataloader for language model trainingdls_lm = DataBlock( blocks=TextBlock.from_folder(path, is_lm=True), get_items=get_imdb, splitter=RandomSplitter(0.1)).dataloaders(path, path=path, bs=128, seq_len=80)
# Create a language model learner, by default will use x-entropy losslearn = language_model_learner( dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()]).to_fp16()# Train modellearn.fit_one_cycle(1, 2e-2)# Save model encoderlearn.save_encoder('finetuned')
3.2 Fine tune classifier model
To fine tune the classifier model we create the data loader in a slightly different way.
# Create text dataloader for classifier model training - using lm vocabdls_clas = DataBlock( blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock), get_y = parent_label, get_items=partial(get_text_files, folders=['train', 'test']), splitter=GrandparentSplitter(valid_name='test')).dataloaders(path, path=path, bs=128, seq_len=72)
# Create classifier learnerlearn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=accuracy).to_fp16()# Load encoder from language modellearn = learn.load_encoder('finetuned')
When fine tuning the classifier, it is found to be best if we gradually unfreeze layers to train, and this is best done in manual steps. The first fit will just train the last layer.
# Train model - last layer onlylearn.fit_one_cycle(1, 2e-2)
# Unfreeze a few more layers and train some more with discriminative learning rateslearn.freeze_to(-2)learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
# Unfreeze more layers and train morelearn.freeze_to(-3)learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
# Unfreeze whole model and train morelearn.unfreeze()learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
On this IMDB dataset we can achieve a classification accuracy of around 95% using this approach.
4 Conclusion
In this article we have looked in more detail at how we can train a text classifier using the 3 step ULMFit fastai approach, and achieve a good level of accuracy. We also saw in more detail what the fastai library does under the hood to make this process much easier.