Topic Modelling using Non-negative Matrix Factorization (NMF)

Singular Value Decomposition (SVD) is a method from Linear Algebra widley used accross science and engineering. In this article we will introduce the concept and show how it can be used for Topic Modelling in Natural Language Processing (NLP).
mathematics
linear-algebra
natural-language-processing
Author

Pranath Fernando

Published

December 28, 2021

1 Introduction

Non-negative Matrix Factorization (NMF) is a method from Linear Algebra that is used in a wide range of applications in science and engineering, similar to Singular Value Decomopistion (SVD) which I covered in an earlier article. It can be used for tasks such as missing data imputation, audio signal processing and bioinformatics.

Topic modeling is an unsupervised machine learning technique used in Natural Language Processing (NLP) that’s capable of scanning a set of texts, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.

In this article we will will use NMF to perform topic modelling.

This article is based in large part on the material from the fastai linear algebra course.

2 Dataset

We will use the 20 Newsgroups dataset which consists of 20,000 messages taken from 20 different newsgroups from the Usenet bulletin board service, which pre-dates the world-wide-web and websites. We will look at a subset of 4 of these newsgroup categories:

  • rec.motorcycles
  • talk.politics.mideast
  • sci.med
  • sci.crypt

We will now get this data.


categories = ['rec.motorcycles', 'talk.politics.mideast', 'sci.med', 'sci.crypt']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

Let’s check how many posts this gives us in total


newsgroups_train.filenames.shape, newsgroups_train.target.shape
((2351,), (2351,))

Let’s print the first few lines of 3 of the posts to see what the text looks like


print("\n".join(newsgroups_train.data[0].split("\n")[:3]))

I am not an expert in the cryptography science, but some basic things
seem evident to me, things which this Clinton Clipper do not address.

print("\n".join(newsgroups_train.data[2].split("\n")[:3]))
Does the Bates method work?  I first heard about it in this newsgroup 
several years ago, and I have just got hold of a book, "How to improve your
sight - simple daily drills in relaxation", by Margaret D. Corbett, 

print("\n".join(newsgroups_train.data[5].split("\n")[:3]))

Suggest McQuires #1 plastic polish.  It will help somewhat but nothing 
will remove deep scratches without making it worse than it already is.

We can also get the newsgroup category for each from the ‘target_names’ attribute


np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]
array(['sci.crypt', 'sci.med', 'sci.med'], dtype='<U21')

To use this text dataset for topic modelling we will need to convert this into a document-term matrix. This is a matrix where the rows will correspond to to each of the newsgroup posts (a ‘document’ conceptually) and the columns will be for each of the words that exists in all posts (a ‘term’ conceptually). The values of the matrix will be the count of the number of words that exists for a particular post for each post/word combination in the matrix.

This method of converting text into a count of the words in the text matrix, without regard for anything else (such as order, context etc) is called a bag of words model. We can create this matrix using a CountVectoriser() function.


vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)
vectors.shape 
(2351, 32291)

We can see this matrix has the same number of rows as we have posts (2351) and we must have 32,291 unique words accross all posts which is the number of columns we have.


print(len(newsgroups_train.data), vectors.shape)
2351 (2351, 32291)

If we print the matrix, its just an array of counts for each of the words in each post


vectors
matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 2, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

This matrix does not actually contain the names of the words, so it will be helpful for us to extract these as well to create a vocabulary of terms used in the matrix. We can extract these using get_feature_names()


vocab = np.array(vectorizer.get_feature_names())
vocab.shape
(32291,)

vocab[:32000]
array(['00', '000', '0000', ..., 'yarn', 'yarvin', 'yashir'], dtype='<U79')

While we have the newsgroup categories here, we will not actually use them for our topic modelling exercise, where we want to create topics independantly based on the posts alone, but we would hope these will correspond to the newsgroup categories in some way, indeed this would be a good check that the topic modelling is working.

Now we have our Document-Term matrix and the vocabulary, we are now ready to use Singular Value Decompostion.

3 Non-negative Matrix Factorization (NMF)

NMF is a method of matrix decomposition, so for a given matrix A we can convert it into 2 other matrices: W and H. Also A most have non-negative values, and as such W and H will also have non-negative values.

K is a value we choose in advance, in the case of our intention here K will repesent the number of topics we want to create for our topic model of the newsgroup posts.

So if we assume in the original matrix A for our exercise, N are the documents/posts and M are the words in our Document-Term matrix, each of these matricies represents the following:

  • W: Feature Matrix this has M rows for words and K columns for the topics, and indicates which words characterise which topics.
  • H: Coefficient Matrix this has K rows for topics, and N columns for documents/posts, and indicates which topics best describe which documents/posts.

So one reason NMF can be more popular to use, is due to that fact that the factors it produces are always positive and so are more easily interpretable. Consider for example with SVD we could produce factors that indicated negative values for topics - what would that mean to say a text has ‘negative indications for the topic of bikes’ ?

Another difference with SVD is that NMF is not an exact decompostion - which means if we multiply W and H matrices we won’t get back our original matrix A exactly.

So we can peform NMF on our Document-Term matrix using the sklearn decomposition module.


# Define constants and functions
m,n=vectors.shape
d=10  # num topics
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

# Calculate NMF
%time clf = decomposition.NMF(n_components=d, random_state=1)
CPU times: user 29 µs, sys: 0 ns, total: 29 µs
Wall time: 34.3 µs

We can notice here this has run extremely fast taking just 19.6 microseconds. If we recall in an earlier article for the same dataset when we performed one of the fastest versions of SVD Randomised/Trucated SVD this took 20 seconds.


# Extract W and H matrices
W1 = clf.fit_transform(vectors)
H1 = clf.components_
# Show topics from H matrix
print('Top 10 topics, described by top words in each topic')
show_topics(H1)
Top 10 topics, described by top words in each topic
['db mov bh si cs byte al bl',
 'people said didn know don went just like',
 'privacy internet pub eff email information computer electronic',
 'health 1993 use hiv medical 10 20 number',
 'turkish jews turkey armenian jewish nazis ottoman war',
 'anonymous anonymity posting internet anon service people users',
 'key encryption des chip ripem use keys used',
 'edu com cs david ca uk org john',
 'dod rec denizens motorcycle motorcycles doom like terrible',
 'version machines contact type edu pc comments ftp']

So if you recall our original news group categories were:

  • rec.motorcycles
  • talk.politics.mideast
  • sci.med
  • sci.crypt

We can see that the topics discovered correspond fairly well to these, bar a few anomalies.


# Show dimensions of matrices
print(W1.shape, H1.shape)
(2351, 10) (10, 32291)

The shapes of the matrices also make sense. Given our original matrix A was 2351 rows for posts and 32291 columns for words, and we requested 10 topics this NMF has returned:

  • Matrix W with 2351 rows for posts and 10 columns for topics
  • Matrix H with 10 rows for topics and 32291 columns for words

4 NMF using Gradient Descent

So in the method just used, we performed NMF using a built in library function from Sklearn. One of the obvious benefits of using this is that it runs extremely fast. However, in order to create this function it took many years of research and expertise in this area. Using this function also means we are limited, if we want to do something slightly different, we can’t really change it.

Alternatively, we can use a very different method to calculate the NMF matrices using Gradient Descent.

The basic process of Gradient Descent is as follows:

  1. Randomly choose some weights to start
  2. Loop:
  • Use weights to calculate a prediction
  • Calculate the loss (loss is a measure of the difference between the prediction and what we want)
  • Calculate the derivative of the loss
  • Update the weights using this derivative to tell us how much to change them
  1. Repeat step 2 lots of times. Eventually we end up with some decent weights

In our case, the weights would be the values of the matrices we want to calculate for NMF which are the values of W and H.

In Stocastic Gradient Decent (SGD) we evaluate our loss function on just a sample of our data (sometimes called a mini-batch). We would get different loss values on different samples of the data, so this is why it is stochastic. It turns out that this is still an effective way to optimize, and it’s much more efficient.

SGD is also a key technique used in Deep Learning which I have covered in an earlier article.

Applying SGD to NMF

The Frobenius norm is a way to measure how different two matrices are. We can use this to calculate the loss by multipling W and H together to create a matrix, and then calculating the Frobenius norm between this matrix and our original matrix A to give us our loss value.

Goal: Decompose \(A\;(m \times n)\) into \[A \approx WH\] where \(W\;(m \times k)\) and \(H\;(k \times n)\), \(W,\;H\;>=\;0\), and we’ve minimized the Frobenius norm of \(A-WH\).

Approach: We will pick random positive \(W\) & \(H\), and then use SGD to optimize.

We will also make use of the Pytorch library for these calculations for 2 key reasons:

  • It facilitates calculations on the GPU which enables matrix calculations to be run in parallel and therefore much faster
  • Pytorch has the autograd functionality which will automatically calculate the derivatives of functions for us and thereby give us the gradients that we need for the process in a convenient way

# Define constants and functions required
lam=1e6
lr = 0.05
# Create W and H matrices
pW = Variable(tc.FloatTensor(m,d), requires_grad=True)
pH = Variable(tc.FloatTensor(d,n), requires_grad=True)
pW.data.normal_(std=0.01).abs_()
pH.data.normal_(std=0.01).abs_()
# Define report
def report():
    W,H = pW.data, pH.data
    print((A-pW.mm(pH)).norm(2).item(), W.min(), H.min(), (W<0).sum(), (H<0).sum())
# Define penalty - encourage positive and low loss values
def penalty(P):
    return torch.pow((P<0).type(tc.FloatTensor)*torch.clamp(P, max=0.), 2)
# Define penalise - for both W and H matrices we want to improve
def penalize(): return penalty(pW).mean() + penalty(pH).mean()
# Define loss - Calculate the Frobenius norm between Matrix A and Matrices W x H
def loss(): return (A-pW.mm(pH)).norm(2) + penalize()*lam
# Define optimiser to update weights using gradients
opt = torch.optim.Adam([pW,pH], lr=1e-3, betas=(0.9,0.9))
# Load our original matrix A onto the GPU
t_vectors = torch.Tensor(v.astype(np.float32)).cuda()
A = Variable(t_vectors).cuda()

Create and run the Stocastic Gradient Descent process


# For 1000 cycles
for i in range(1000): 
    # Clear the previous gradients
    opt.zero_grad()
    # Calculate the loss i.e. the Frobenius norm between Matrix A and Matrices W x H
    l = loss()
    # Calculate the gradients
    l.backward()
    # Update the values of Matrices W x H using the gradients
    opt.step()
    # Every 100 cycles print a report of progress
    if i % 100 == 99: 
        report()
        lr *= 0.9     # learning rate annealling
47.2258186340332 tensor(-0.0010, device='cuda:0') tensor(-0.0023, device='cuda:0') tensor(1013, device='cuda:0') tensor(42676, device='cuda:0')
46.8864631652832 tensor(-0.0008, device='cuda:0') tensor(-0.0027, device='cuda:0') tensor(1424, device='cuda:0') tensor(53463, device='cuda:0')
46.73139572143555 tensor(-0.0004, device='cuda:0') tensor(-0.0031, device='cuda:0') tensor(929, device='cuda:0') tensor(53453, device='cuda:0')
46.66544723510742 tensor(-0.0004, device='cuda:0') tensor(-0.0020, device='cuda:0') tensor(736, device='cuda:0') tensor(54012, device='cuda:0')
46.620338439941406 tensor(-0.0006, device='cuda:0') tensor(-0.0018, device='cuda:0') tensor(631, device='cuda:0') tensor(56201, device='cuda:0')
46.586158752441406 tensor(-0.0003, device='cuda:0') tensor(-0.0018, device='cuda:0') tensor(595, device='cuda:0') tensor(56632, device='cuda:0')
46.576072692871094 tensor(-0.0003, device='cuda:0') tensor(-0.0019, device='cuda:0') tensor(585, device='cuda:0') tensor(54036, device='cuda:0')
46.573974609375 tensor(-0.0003, device='cuda:0') tensor(-0.0018, device='cuda:0') tensor(578, device='cuda:0') tensor(53401, device='cuda:0')
46.573814392089844 tensor(-0.0003, device='cuda:0') tensor(-0.0017, device='cuda:0') tensor(667, device='cuda:0') tensor(52781, device='cuda:0')
46.573760986328125 tensor(-0.0003, device='cuda:0') tensor(-0.0019, device='cuda:0') tensor(662, device='cuda:0') tensor(52658, device='cuda:0')

# Show topics discovered
h = pH.data.cpu().numpy()
show_topics(h)
['msg don people know just food think like',
 'clipper chip phone crypto phones government nsa secure',
 'armenian armenians turkish genocide armenia turks turkey people',
 'jews adam jewish land shostack das harvard arabs',
 'com edu pgp mail faq rsa list ripem',
 'israel israeli lebanese arab lebanon peace israelis arabs',
 'key keys bit chip serial bits 80 number',
 'encryption government technology law privacy enforcement administration use',
 'geb dsl cadre chastity n3jxp pitt intellect shameful',
 'bike bikes ride motorcycle riding dod dog good']

So if you recall our original news group categories were:

  • rec.motorcycles
  • talk.politics.mideast
  • sci.med
  • sci.crypt

We can see that the topics discovered using SGD correspond fairly well to these, bar a few anomalies.

4.1 Comparing Approaches

If we compare our two approaches to calculating NMF.

Scikit-Learn’s NMF - Fast - No parameter tuning - Relies on decades of academic research, took experts a long time to implement - Can’t be customised - Method can only be applied to calculating NMF

Using PyTorch and SGD - Took an hour to implement, didn’t have to be NMF experts - Parameters were fiddly - Not as fast - Easily customised - Method can be applied to a vast range of problems

5 Conclusion

In this article we introduced Non-negative Matrix Factorization (NMF) and saw how it could be applied to the task of topic modelling in NLP. We also compared two approaches to calculating NMF using Scikit-Learn’s library function as well as Stocastic Gradient Descent (SGD) and highlighted various pros and cons of each approach.

Subscribe