Sentiment Analysis using NLP Libraries
Sentiment analysis is perhaps one of the most popular applications of natural language processing and text analytics with a vast number of websites, books and tutorials on this subject. Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real- world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment.

A text corpus consists of multiple text documents and each document can be as simple as a single sentence to a complete document with multiple paragraphs. Textual data, in spite of being highly unstructured, can be classified into two major types of documents. Factual documents that typically depict some form of statements or facts with no specific feelings or emotion attached to them. These are also known as objective documents. Subjective documents on the other hand have text that expresses feelings, moods, emotions, and opinions.
Sentiment analysis is also popularly known as opinion analysis or opinion mining. The key idea is to use techniques from text analytics, NLP, Machine Learning, and linguistics to extract important information or data points from unstructured text. This in turn can help us derive qualitative outputs like the overall sentiment being on a positive, neutral, or negative scale and quantitative outputs like the sentiment polarity, subjectivity, and objectivity proportions.
Sentiment polarity is typically a numeric score that’s assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express and specific sentiment, positive sentiment will have polarity > 0, and negative < 0. Of course, you can always change these thresholds based on the type of text you are dealing with; there are no hard constraints on this.
Problem Statement
The main objective in this Project is to predict the sentiment for a number of movie reviews obtained from the Internet Movie Database (IMDb). This dataset contains 50,000 movie reviews that have been pre-labeled with “positive” and “negative” sentiment class labels based on the review content. Besides this, there are additional movie reviews that are unlabeled.

The dataset can be obtained from http://ai.stanford.edu/~amaas/data/sentiment/
courtesy of Stanford University and Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts.
Hence our task will be to predict the sentiment of 15,000 labeled movie reviews and use the remaining 35,000 reviews for training our supervised models.
Step 1: Text Pre-Processing and Normalization
One of the key steps before diving into the process of feature engineering and modeling involves cleaning, pre-processing, and normalizing text to bring text components like phrases and words to some standard format. This enables standardization across a document corpus, which helps build meaningful features and helps reduce noise that can be introduced due to many factors like irrelevant symbols, special characters, XML and HTML tags, and so on.
• Cleaning text:
Our text often contains unnecessary content like HTML tags, which do not add much value when analyzing sentiment. Hence we need to make sure we remove them before extracting features. The BeautifulSoup library does an excellent job in providing necessary functions for this.
• Removing accented characters:
In our dataset, we are dealing with reviews in the English language so we need to make sure that characters with any other format, especially accented characters are converted and standardized into ASCII characters. A simple example would be converting é to e.
• Expanding contractions:
In the English language, contractions are basically shortened versions of words or syllables. These shortened versions of existing words or phrases are created by removing specific letters and sounds. Examples would be, expand don’t to do not and I’d to I would. Contractions pose a problem in text normalization because we have to deal with special characters like the apostrophe and we also have to convert each contraction to its expanded, original form.
• Removing special characters:
Another important task in text cleaning and normalization is to remove special characters and symbols that often add to the extra noise in unstructured text. Simple regexes can be used to achieve this. Its your choice to retain numbers or remove them, if you do not want them in your normalized corpus.
• Removing stop words:
Words which have little or no significance especially when constructing meaningful features from text are also known as stop words. Words like a, an, the, and so on are considered to be stopwords. There is no universal stopword list but we use a standard English language stopwords list from nltk. You can also add your own domain specific stopwords if needed.
• Stemming and Lemmatization:
Word stems are usually the base form of possible words that can be created by attaching affixes like prefixes and suffixes to the stem to create new words. This is known as inflection. The reverse process of obtaining the base form of a word is known as stemming. A simple example are the words WATCHES, WATCHING, and WATCHED. They have the word root stem WATCH as the base form. The nltk package offers a wide range of stemmers like the PorterStemmer and LancasterStemmer. Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form in this case is known as the root word but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary) but the root stem may not be so. We will be using Lemmatization only in our normalization pipeline to retain lexicographically correct words.
My text normalization code: -
We would cover a two varieties of techniques for analyzing sentiment, which include the following.
- Traditional supervised Machine Learning models
- Unsupervised lexicon-based models
Unsupervised Learning
Unsupervised sentiment analysis models use well curated knowledge bases, ontologies, lexicons, and databases that have detailed information pertaining to subjective words, phrases including sentiment, mood, polarity, objectivity, subjectivity, and so on. A lexicon model typically uses a lexicon, also known as a dictionary or vocabulary of words specifically aligned toward sentiment analysis. Usually these lexicons contain a list of words associated with positive and negative sentiment, polarity (magnitude of negative or positive score), parts of speech (POS) tags, subjectivity classifiers (strong, weak, neutral), mood, modality, and so on. There are several popular lexicon models used for sentiment analysis. Some of them are mentioned as follows.
- AFINN Lexicon
- SentiWordNet Lexicon
- VADER Lexicon
Steps: -
- Setting up Dependencies
- Sentiment Analysis using AFINN
3. Sentiment Analysis using SentiWordNet
4. Sentiment Analysis using VADER
My code: —
Model evaluation using AFINN Lexicon:

Model Evaluation using SentiWordNet Lexicon:

Model Evaluation using VADER Lexicon:

Results derived from Unsupervised Learning : -

From the results we can conclude that AFINN Lexicon is the most accurate amongst other lexicon models as its accuracy score is higher than other models.
Supervised Learning: -
Another way to build a model to understand the text content and predict the sentiment of the text based reviews is to use supervised Machine Learning. To be more specific, we will be using classification models for solving this problem.
The major steps to achieve this are mentioned as follows :
- Prepare train and test datasets (optionally a validation dataset)
- Pre-process and normalize text documents
- Feature engineering
- Model training
- Model prediction and evaluation
My code: -
Logistic Regression model on BOW features

Logistic Regression model on TF-IDF features

SVM model on BOW features

SVM model on TF-IDF features

Results derived from traditional supervised learning model
From the results of we can say that the overall F1-score and model accuracy of this supervised ML model is around 0.9 and 0.9 respectively.
Conclusions
Comparison of Unsupervised AFINN Lexicon based models and supervised Logistic Regression model on BOW features is given below:
![]() |
On comparing the overall F1-Score and accuracy of Supervised ML Model with the best Unsupervised Lexicon Model(AFINN) ,we conclude that Supervised Leaning gives us a more accurate and precise results.
Published by Jay Patel and this problem was a part of my internship programme at Suven Consultants & Technology Pvt Ltd.












Thank you for the insightful exploration of sentiment analysis on reviews. Your examination of this topic sheds light on the complexities and nuances involved in interpreting and understanding the sentiments expressed in textual data. By delving into the realm of sentiment analysis, you've provided valuable insights into how technology can be leveraged to extract meaningful information from reviews and other textual sources. Your contributions enhance our understanding of this fascinating field and its applications in various domains.
ReplyDelete