NLP and Text Classification Without Deep Learning for Business Applications

Deep Learning and AI is powering some of the most recent amazing advances in text & natural language processing (NLP) applications, such as GPT-3, Chat-GPT and Dall-E but these often require specialist resources such as deep learning. With Machine Learning (ML) its possible to create useful NLP applications for businesses without using AI and Deep Learning.
pycaret
natural-language-processing
Author

Pranath Fernando

Published

January 8, 2023

1 Introduction

Deep Learning and AI is powering some of the most recent amazing advances in text & natural language processing (NLP) applications, such as GPT-3, Chat-GPT and Dall-E, but these often require specialist resources such as GPU servers that many businesses new to this technology don’t have or can’t yet justify these resources. With traditional Machine Learning (ML) its possible to create useful NLP applications such as text classification without using AI and Deep Learning, and in this article we will look at some examples of how these can provide useful business applications.

2 Business Applications of NLP

NLP (Natural Language Processing) is a branch of Artificial Intelligence (AI) and Data Science that is having a huge effect on all areas of society, including business.

In essence, Natural language processing helps computers communicate with humans in their own language and scales other language-related tasks. For example, NLP makes it possible for computers to read text, hear speech, interpret it, measure sentiment and determine which parts are important.

A recent article by the Harvard Business Review highlighted some of the huge potential NLP has for businesses.

Until recently, the conventional wisdom was that while AI was better than humans at data-driven decision making tasks, it was still inferior to humans for cognitive and creative ones. But in the past two years language-based AI has advanced by leaps and bounds, changing common notions of what this technology can do. The most visible advances have been in what’s called “natural language processing” (NLP), the branch of AI focused on how computers can process language like humans do. It has been used to write an article for The Guardian, and AI-authored blog posts have gone viral — feats that weren’t possible a few years ago. AI even excels at cognitive tasks like programming where it is able to generate programs for simple video games from human instructions.

A recent article on LinkedIn highlighted some of the top business applications of NLP these include:

2.1 Market Intelligence

Marketers can utilize natural language processing to understand their clients better and use those insights to develop more effective tactics. They can analyze subjects and keywords and make effective use of unstructured data thanks to the power of NLP. It can also determine your consumers pain points and maintain track of your competition.

2.2 Sentiment Analysis

Companies can regularly use sentiment analysis to acquire a better knowledge of their business. Humans can be sarcastic and sardonic during conversations. You may keep an eye on social media mentions and use real-time sentiment analysis to intervene before things get out of hand. Your company may sense the pulse of its customers with this NLP application. It also allows you to evaluate how your clients reacted to your most recent digital marketing campaign.

2.3 Text Classification

Text classification, is a text analysis task that also includes sentiment analysis, involves automatically understanding, processing, and categorizing unstructured text.

Let’s say you want to analyze hundreds of open-ended responses to your recent NPS survey. Doing it manually would take you a lot of time and end up being too expensive. But what if you could train a natural language processing model to automatically tag your data in just seconds, using predefined categories and applying your own criteria.

2.4 Topic Modelling

Topic modeling is an approach that can scan a series of documents, find word and phrase patterns within them, and automatically cluster word groupings and related expressions that best represent the set.

Topic Modeling doesn’t require a preexisting list of tags or training data that has been previously categorized by humans, it can ‘discover’ what seem the most appropriate categories for a given set of documents for itself, based on which documents seem the most similar or different.

2.5 Recruiting And Hiring

We can all agree that picking the right staff is one of the most important duties performed by the HR department. However, HR has so much data in the current situation that sifting resumes and shortlisting prospects become overwhelming.

Natural Language Processing can help to make this work more accessible. HR experts can use information extraction and named entity recognition to extract information from candidates, such as their names, talents, locations, and educational histories. This enables unbiased resume filtering and the selection of the best candidate for the job.

2.6 Text Summarization

This NLP application extracts the most crucial information from a text and summarises it. The primary purpose is to speed up sifting through massive volumes of data in news articles, legal documents, and scientific studies. Text summarization can be done in two ways: extraction-based summarization, which selects crucial words and provides a summary without adding further information, and abstraction-based summarization, which paraphrases the original content to produce new terms.

2.7 Survey Analysis

Surveys are an essential tool for businesses to use in evaluating their performance. Survey analysis is crucial in finding defects and supporting companies in improving their goods, whether gathering input on a new product launch or analyzing how effectively a company’s customer service is doing. When many clients complete these surveys, the issue emerges, resulting in massive data. The human brain is unable to comprehend everything. At this time, natural language processing is introduced. These methods help organisations get accurate information about their consumers’ opinions and improve their performance.

3 Machine Learning vs Deep Learning for NLP and Business

The most powerful and useful applications of NLP use Deep Learning and AI which is a sub-branch of Machine Learning. All the the most recent and most powerful applications of NLP such as GPT-3, Chat-GPT and Dall-E all use Deep Learning. Many would argue Deep Learning is perfect for NLP.

In fact, most of my own recent projects in NLP over the last few years have almost exclusively used Deep Learning.

However before Deep Learning and AI existed and was developed recently, NLP still existed for many years and has its origins in work in the 1950’s. It just used different methods and techniques, that while not as powerful as Deep Learning and AI, still provided useful business applications and benefits at the time they were developed and used. These include the use of traddtional machine learning for NLP.

In a recent article i covered in more detail the differences between tradditonal machine learning and deep learning.

Also, Deep Learning requires the use of specialist resources - namely GPU servers. Many businesses starting to explore the potental benefit of Data, Data Science, Machine Learning and AI don’t always have the rescources or infrastructure setup to develop this technology.

Furthermore, some businesses may feel much more cautious to adopt this technology and the associated cost of resources, and may need a more gradual approach that takes them on a journey as much about education, learning what this technology can do to help solve business problems, as much as gradually using more and more advanced technology.

Some businesses, especially older & established businesses with exisiting business practices, may need to learn slowly how to walk first before running with the most advanced technology!

With this in mind, it’s good to know it is actually possible to develop useful and valuable NLP business applications - without the use of Deep Learning and the specialist resources that requires. While you might not get the best or near state of the art results for your solution, businesses can still gain huge value and benefit by using these slightly older methods compared to none at all.

4 Pycaret and NLP

NLP often requires a significant amount of code and steps to solve business problems. Pycaret is a low code machine learning library, that allows you to perform common tasks in Data Science and Machine Learning with very little code, and has been listed in a recent article by Forbes as one of the 10 Best Examples Of Low-Code And No-Code AI

I’ve been using Pycaret myself professionally in my role as a Data Scientist as well as for personal projects for over a year now and have found it incredibily useful to enable me to work much more quickly and efficiently. I’ve also written about how Pycaret is actually a Data Science Power Tool.

In this project I will be using Pycaret for the NLP tasks we will be doing to solve certain business problems using machine learning.

5 Text Classification Without Deep Learning

Remembering our common uses of NLP, we are going to solve 2 different business problems to illustrate these methods:

  • Topic Modelling: We will use this method to try to discover what the hidden categories are for a dataset from kiva - a crowdfunder for loans which includes text data of each loan application. Or put another way - what kind of hidden topics would best describe peoples loan applications? For most busineses, it might be really useful to understand using customer text, such as customer contact form text etc, and discover what kind of topics customers were talking about without us knowing or assuming we know what they are before hand.
  • Sentiment Analysis & Classification: We will use this method to learn to predict the sentiment of amazon customer product reviews using the review text, and each of the positive or negative labels they have been assigned in the dataset. In other words, given a customer review text - to predict if this is a positive or negative review. This could be very useful for a business to understand if a product or service was succesful or not, by analysing thousands or even millions of customer reviews automatically and efficiently.

Note, with Topic Modelling we are actually trying to discover new categories for a given set of texts, wheras with Sentiment Analysis & Classification we are using an exisiting category. These are known as unsupervised machine learning and supervised machine learning respectively. In both cases, we produce something called a model which is something that we can then use on new text to predict what category that text is.

5.1 Topic modelling - Discovering hidden categories in Kiva loan applications

Pycaret comes with some ready to use datasets such as Kiva. Kiva is a non-profit that allows individuals to lend money to low-income entrepreneurs and students around the world. The kiva dataset is data on individual loan applications which include the text of the application. Lets load and view the data.

kiva = get_data('kiva')
country en gender loan_amount nonpayment sector status
0 Dominican Republic "Banco Esperanza" is a group of 10 women looking to receive a small loan. Each of them has taken out a very small loan already, so this would be their second. With this loan the group is going to try and expand their small businesses and start generating more income. <P>\n\nEduviges is the group representative and leader of the group. Eduviges has a lot on the line because she has 6 children that she has to take care of. She told me that those children are the reason she wants to be successful. She wants to be able to provide a different life for them and show them that they can be successful as well. <P>\n\nEduviges has a very small business selling shoes and Avon products. She plans to expand using this loan and dreams of success. The whole group is ready for this new challenge and a... F 1225 partner Retail 0
1 Dominican Republic "Caminemos Hacia Adelante" or "Walking Forward" is a group of ten entrepreneurs seeking their second loan from Esperanza International. The groups past loan has been successfully repaid and the group hopes to use additional loan funds for further business expansion. \n\nEstella is one of the coordinators for this group in Santiago. Estella sells undergarments to her community and neighboring communities. Estella used her first loan, which has now been completely repaid, to buy additional products and Estela was able to increase the return on her business by adding inventory. Estella wants to use her second loan to buy more undergarments to sell to her customers. \n\nEstella lives with her mother and sister and dreams of improving the house they live in and plans to use her business ... F 1975 lender Clothing 0
2 Dominican Republic "Creciendo Por La Union" is a group of 10 people hoping to start their own businesses. This group is looking to receive loans to either start a small business or to try and increase their business. Everyone in this group is living in extreme poverty, and they see this as a chance to improve their lives and the lives of their families. \n\n"Dalina" is the group representative and was chosen because she is a very hardworking women. She is a young mother of two children, and she realized that she wanted a better life for her and her family. She is hoping to start a small business of selling clothes to people in her barrio. She hopes to someday have a thriving business and be able to provide for her family. On behalf of Dalina, the rest of the group, and Esperanza International: Thank you ... F 2175 partner Clothing 0
3 Dominican Republic "Cristo Vive" ("Christ lives" is a group of 10 women who are looking to receive their first loans. This is a very young group of women, and they all want to start changing their lives right away. Riquena is the group representative and leader of this group, and she is only 18 years old. She is also married, but has no children. She told me that once she has kids she wants to be able to provide them with a good life, and that is the main reason she is trying to start her own business. She plans on selling used clothes in her area, and hopes to one day have a big clothing store, and also design clothes. She is a very motivated person, and you can see it when you speak with her. She speaks Spanish and Creole fluently, and is studying English. This whole group is ready for this next step, ... F 1425 partner Clothing 0
4 Dominican Republic "Cristo Vive" is a large group of 35 people, 20 of which are hoping to take out a loan. For many of them this is their second loan, and a loan they hope to use to increase their business. The business range from clothing sales to salons. Miline is the chosen group representative due to her hard work and dedication. Miline is a hardworking mother of 5 very young children, the oldest being only 10 years old. She took her first loan and started a small business of selling chicken and other types of food. With this next loan she feels like she can increase her business greatly and start making money to support her family. Her dream is to have her own store someday, and be able to provide her family with comfortable life. On behalf of Miline, the group, and Esperanza International, thank yo... F 4025 partner Food 0

Let’s check how big the dataset is.

kiva.shape[0]
6818

So we have around 7,000 loan applications. Lets now process and prepare the data.

%time experiment1 = setup(data=kiva, target='en')
Description Value
session_id 2214
Documents 6818
Vocab Size 12383
Custom Stopwords False
CPU times: user 1min 14s, sys: 295 ms, total: 1min 15s
Wall time: 1min 15s

This single line of code has actually performed a large number of tasks that would normally take many lines of code, but in Pycaret is a single line of code. You can find out more about what this line does for NLP text pre-processing here.

Now our data is prepared, lets create our topic model.

For topic modelling we will be using the Latent Dirichlet Allocation (LDA) technique. I’ve written previously about the mathemetics behind two other techniques called Non-negative Matrix Factorization (NMF) and Singular Value Decomposition (SVD).

lda_topic_model = create_model('lda', num_topics=4)

So we now have our topic model. Notice we have set ‘num_topics=4’ - this means the model tries to discover the 4 topics that seem most relevant to the loan applications. We could set this to a different number if we wanted to.

Now we have discovered our 4 topics for the loan applications and trained a model to recognise them, we can use this model to predict each of these 4 topics for all our applications using the assign_model() function.

lda_results = assign_model(lda_topic_model)
lda_results.head()
country en gender loan_amount nonpayment sector status Topic_0 Topic_1 Topic_2 Topic_3 Dominant_Topic Perc_Dominant_Topic
0 Dominican Republic group woman look receive small loan take small loan already second loan group go try expand small business start generate income group representative leader group eduvige lot line child tell child reason want successful want able provide different life show successful well eduvige small business selling shoe avon product plan expand use loan dream success whole group ready new challenge road better live behalf eduvige thank support F 1225 partner Retail 0 0.410590 0.044232 0.001707 0.543472 Topic 3 0.54
1 Dominican Republic caminemos walk forward group entrepreneur seek second loan esperanza_international group loan successfully_repaid group hope use additional loan fund business expansion coordinator group sell undergarment community neighboring community use first loan completely repay buy additional product estela able increase return business add inventory estella want use second loan buy undergarment sell customer live mother sister dream improve house live plan use business profit member art juice ice_cream fry food cake sale behalf esperanza group business entrepreneur like thank support F 1975 lender Clothing 0 0.608610 0.084845 0.001478 0.305067 Topic 0 0.61
2 Dominican Republic por la_union group people hope start business group look receive loan start small business try increase business group poverty see chance improve life live family representative choose hardworke woman young mother child realize want well life family hope start small business sell clothe people barrio hope someday thrive business able provide family behalf thank support F 2175 partner Clothing 0 0.486984 0.012169 0.002022 0.498825 Topic 3 0.50
3 Dominican Republic vive live group woman look receive first loan young group woman want start change life right away riquena group representative leader group year old also marry child tell kid want able provide good life main reason try start business plan sell use clothe area hope day big clothing store also design clothe motivated person see speak speak spanish creole fluently study english whole group ready next step excited_opportunity behalf thank support F 1425 partner Clothing 0 0.289351 0.071750 0.001620 0.637279 Topic 3 0.64
4 Dominican Republic cristo vive large group people hope take loan many second loan hope use increase business business range clothing sale salon miline choose group representative due hard work dedication miline hardworke mother young child old year old take first loan start small business sell chicken type food next loan feel increase business greatly start make money support family dream store someday able provide family comfortable life behalf miline thank support F 4025 partner Food 0 0.562529 0.032050 0.001672 0.403749 Topic 0 0.56

We can see the topic model has given us several new things. Firstly, for each loan application it has given us a measure of how much of each of the 4 topics that loan application scores for - which would be a value between 0 and 1. Secondly, for each loan application Dominant_Topic tells us which is the most important topic. Finally, Perc_Dominant_Topic tells hows how highly that loan application scores for its dominant topic.

Lets have a look at how many loan applications are within each of the 4 topics, Pycaret makes this very easy using the plot_model() function.

plot_model(lda_topic_model, plot = ‘topic_distribution’)

So we can see that topic 0 covers most of the loan applications, and the other topics much less, with topic 1 having very few examples.

What are topics actually about ? Word counts

How can we find out what these hidden topics are about? We can look at the top 100 words in the text of each topic to give us some idea.

Again, Pycaret makes this very easy again using the plot_model() function.

plot_model(lda_topic_model, plot = ‘frequency’, topic_num = ‘Topic 0’)

So we can see for topic 0 the top 4 words are:

  • Business
  • Year
  • Child
  • Old

You could imagine perhaps the loan applications for this topic might emphasise for example how these loans would have a benefit in a specific year, or would benefit perhaps both older and younger people in the community?

Lets have a look at topic 1.

plot_model(lda_topic_model, plot = ‘frequency’, topic_num = ‘Topic 1’)

So we can see for topic 1 the top 4 words are:

  • Year
  • Loan
  • Community
  • Clinic

Perhaps applications under this topic tend to emphasise how the loan might benefit the local community, including healthcare services specifically?

Lets examine topic 2.

plot_model(lda_topic_model, plot = ‘frequency’, topic_num = ‘Topic 2’)

So we can see for topic 2 the top 4 words are:

  • Rice
  • Farmer
  • Use
  • Sector

For this topic it might be the case that these loan applications could be for projects more relating to agriculture and food production.

Finally lets explore topic 3.

plot_model(lda_topic_model, plot = ‘frequency’, topic_num = ‘Topic 3’)

The top 4 words for topic 3 are:

  • Loan
  • Child
  • School
  • Sell

You could imagine that perhaps loans under this topic might be related to education and schools, and perhaps also the buying and selling of products for schools or children.

So this have given us some good indications as to what the different hidden topics might be about regarding these loan applications.

How similar or different are topics? Dimensionality Reduction

Another thing we can do is look at these loan applicaton texts spatially. We can convert these texts into numbers that represent these texts in terms of their meaning, then plot these numbers as points in 3D space. Each point will then represent an individual loan application, and points that are closer will be applications that are more similar, and points further away applications more different.

This general approach of reducing data down into simplified numbers is called Dimenstionality Reduction and you can find more about these methods in an earlier project i did on this. We will use a method for this called TSNE.

Again Pycaret makes this very easy to do using the plot_model() function.

plot_model(lda_topic_model, plot = ‘tsne’)

We can tell a few things from this view of the loan applications and topics:

  • All topics seem to be fairly distinct with little overlap
  • Topic 0, 1 & 3 seem to meet at the edges suggesting there are a few cases that could be in either topic
  • Topic 2 seems to be the most unique, its the most separated from the others spatially

This seems to confirm what we found when we looked at the top words from each topic, topic 2 was about farming and agriculture which really was much more unique compared to the other topics, which had a little more overlap between them.

So we can see that topic modelling can be a very useful technique for businesses to provide insight on a group of text that we may know nothing about. It can help us discover hidden categories among these texts, how many are under each of these categories, how closely related or distinct these categories are - and much more. This could easily be applied to customer queries, survey responses, transcripts of customer conversations or emails, and more - to help businesses gain useful insights from their textual data.

5.2 Sentiment Analysis & Classification - Predict if Amazon product reviews are positive or negative

Pycaret also comes with a dataset of amazon product reviews, lets load these and have a look.

amazon_reviews = get_data('amazon')
reviewText Positive
0 This is a one of the best apps acording to a bunch of people and I agree it has bombs eggs pigs TNT king pigs and realustic stuff 1
1 This is a pretty good version of the game for being free. There are LOTS of different levels to play. My kids enjoy it a lot too. 1
2 this is a really cool game. there are a bunch of levels and you can find golden eggs. super fun. 1
3 This is a silly game and can be frustrating, but lots of fun and definitely recommend just as a fun time. 1
4 This is a terrific game on any pad. Hrs of fun. My grandkids love it. Great entertainment when waiting in long lines 1

So we can see we have just a column for the text of the review, and another called ‘Positive’ which is a label to indicate if the review was positive or not i.e. 1 or 0. Let’s see how many reviews we have.

amazon_reviews.shape[0]

So we have around 20,000 reviews. Lets get a count of how many positive and negative reviews we have.

amazon_reviews['Positive'].value_counts()
1    15233
0     4767
Name: Positive, dtype: int64

So around 75% of the reviews are positive, and 25% negative reviews.

To create a classification model, we will first need to create some features. These are essentially numbers that represent something we are trying to predict, so given we are trying to predict if a review is positive or negative, these features need to represent something about the text that will help us predict that.

There are many methods of turning text into numeric features, but we are actually going to use topic modelling to create some topics to describe our text, and use these as features to help our classfier model to predict positive or negative sentiment.

Lets set up and process our review data for topic modelling.

%time experiment2 = setup(data=amazon_reviews, target='reviewText')
Description Value
session_id 497
Documents 20000
Vocab Size 12771
Custom Stopwords False
CPU times: user 1min 28s, sys: 1.51 s, total: 1min 30s
Wall time: 1min 35s

As before we will create a topic model to create some new categories.

lda_topic_model2 = create_model('lda')

Let’s now predict these categories for our reviews.

lda_results = assign_model(lda_topic_model2)
lda_results.head()
reviewText Positive Topic_0 Topic_1 Topic_2 Topic_3 Dominant_Topic Perc_Dominant_Topic
0 good app acorde bunch people agree bomb egg pig king pig realustic stuff 1 0.081603 0.309925 0.227132 0.381340 Topic 3 0.38
1 pretty good version game free lot different level play kid enjoy lot 1 0.070119 0.200039 0.249249 0.480594 Topic 3 0.48
2 really cool game bunch level find golden egg super fun 1 0.116654 0.263965 0.197222 0.422159 Topic 3 0.42
3 silly game frustrating lot fun definitely recommend fun time 1 0.077698 0.148072 0.309584 0.464646 Topic 3 0.46
4 terrific game pad fun grandkid love great entertainment wait long line 1 0.072539 0.138212 0.424701 0.364547 Topic 2 0.42

So our data is almost ready. Our classification model does’nt need the text data now as we have represented the text using values for our new categories created by our topic model. We also don’t need the Dominant or Perc topic fields, so lets drop these columns.

lda_results.drop(['reviewText', 'Dominant_Topic', 'Perc_Dominant_Topic'], axis=1, inplace=True)
lda_results.head()
Positive Topic_0 Topic_1 Topic_2 Topic_3
0 1 0.081603 0.309925 0.227132 0.381340
1 1 0.070119 0.200039 0.249249 0.480594
2 1 0.116654 0.263965 0.197222 0.422159
3 1 0.077698 0.148072 0.309584 0.464646
4 1 0.072539 0.138212 0.424701 0.364547

It’s common practice when training classification models to split the data, some to train the model on, and some to test the model later. Let’s split this data of 20,000 reviews, to give is a small test data set.

train, test = split_data(lda_results)

Let’s now set the data up, this time to prepare it for classification model training using our training data.

%time experiment3 = setup(data=train, target='Positive')
  Description Value
0 Session id 227
1 Target Positive
2 Target type classification
3 Data shape (19980, 5)
4 Train data shape (13985, 5)
5 Test data shape (5995, 5)
6 Numeric features 4
7 Preprocess True
8 Imputation type simple
9 Numeric imputation mean
10 Categorical imputation constant
11 Fold Generator StratifiedKFold
12 Fold Number 10
13 CPU Jobs -1
14 Log Experiment False
15 Experiment Name clf-default-name
16 USI 22b0
CPU times: user 259 ms, sys: 8.99 ms, total: 267 ms
Wall time: 269 ms

Let’s now train a range of different models to predict the positive or negative sentiment, and choose the best one.

Again Pycaret makes this very easy to do something that would normally take many lines of code to do.

compare_models(exclude='dummy')
  Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec)
svm SVM - Linear Kernel 0.7618 0.0000 1.0000 0.7618 0.8648 0.0000 0.0000 0.0220
lr Logistic Regression 0.7617 0.6472 0.9981 0.7625 0.8645 0.0053 0.0294 0.0290
ridge Ridge Classifier 0.7617 0.0000 0.9992 0.7620 0.8646 0.0019 0.0150 0.0160
lda Linear Discriminant Analysis 0.7616 0.6474 0.9948 0.7637 0.8641 0.0156 0.0512 0.0210
gbc Gradient Boosting Classifier 0.7610 0.6559 0.9965 0.7626 0.8640 0.0065 0.0282 0.8190
ada Ada Boost Classifier 0.7602 0.6476 0.9937 0.7631 0.8633 0.0103 0.0318 0.2600
catboost CatBoost Classifier 0.7600 0.6468 0.9868 0.7658 0.8624 0.0316 0.0690 6.6620
lightgbm Light Gradient Boosting Machine 0.7583 0.6380 0.9829 0.7661 0.8610 0.0332 0.0675 0.1940
nb Naive Bayes 0.7540 0.6470 0.9608 0.7720 0.8561 0.0727 0.1019 0.0250
xgboost Extreme Gradient Boosting 0.7495 0.6231 0.9590 0.7692 0.8537 0.0528 0.0750 0.8160
qda Quadratic Discriminant Analysis 0.7439 0.6441 0.9504 0.7712 0.8465 0.0333 0.0493 0.0190
rf Random Forest Classifier 0.7233 0.5970 0.8956 0.7758 0.8314 0.0819 0.0892 1.3430
knn K Neighbors Classifier 0.7171 0.5745 0.8887 0.7737 0.8272 0.0683 0.0737 0.0930
et Extra Trees Classifier 0.7058 0.5801 0.8628 0.7760 0.8171 0.0756 0.0786 0.6430
dt Decision Tree Classifier 0.6556 0.5333 0.7667 0.7780 0.7723 0.0657 0.0658 0.0740
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.001, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='l2',
              power_t=0.5, random_state=227, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

The F1 score is a good measure of how well a model is predicting both positive and negative sentiment, the best model for this is ‘svm’.

Lets use this model on our test data to see if it seems to be predicting correct sentiment for our reviews.

best_model = create_model('svm', verbose=False)
new_predictions = predict_model(best_model, data=test)
new_predictions = new_predictions.join(amazon_reviews)
new_predictions = new_predictions[['reviewText', 'Topic_0', 'Topic_0', 'Topic_0', 'Topic_0', 'Positive', 'Label']]
new_predictions.head()
reviewText Topic_0 Topic_0 Topic_0 Topic_0 Positive Label
60 who doesn't like angrybirds?but the paid version is better as it doesn't have all those annoying adds. blocking your shots! 0.085445 0.085445 0.085445 0.085445 1 1
159 Free and fun, what could be better? The birds are angry, it's everything I expected, and anyway, those pigs had it coming! 0.079090 0.079090 0.079090 0.079090 1 1
1294 I downloaded this to my tablet, as my phone is out of space. Very easy to read the latest tweets that way 0.118320 0.118320 0.118320 0.118320 1 1
4352 I love this App and also use Out Of Milk via the website. It makes creating my lists and sharing it with others, quick and easy! It also keeps track of my cost as I add to is, making budgeting a breeze. 0.081643 0.081643 0.081643 0.081643 1 1
7016 its actualy saying wat I'm going through. its very fun and creative. I will be sure to use it everyday. no complaints. good job guys. :) 0.104748 0.104748 0.104748 0.104748 1 1

‘Positive’ is our original sentiment for our reviews, and ‘Label’ is the sentiment predicted by the model. Looking at the first few reviews seems to confirm that our model is able to predict the sentiment of reviews quite well.

This type of text classification or sentiment analysis model could be used for many different types of business application, for example on customer requests to identify complaints. A customer complaints prediction model could be used to classify thousands of customer requests, which could then be used to prioritise customer requests that are flagged as complaints by the model, or pass these on to a specialist team. This could ensure customer complaints were dealt with quickly regardless of how many total customer messages were incoming.

6 Conclusion

In this article we have looked at the huge benefits NLP applications can bring to businesses. Most state of the art NLP applications use deep learning which often require specialist resources not all businesses will be able or willing initially to support.

We have shown here some examples of how NLP applications without deep learning - such as topic modelling or sentiment analysis and text classification, can bring huge benefits to businesses despite not being state of the art methods, especially for businesses new to Data Science, Machine Learning and AI.

Subscribe