Sentiment analysis or opinion mining is a natural language processing technique used to determine whether data is positive-negative or neutral. Sentiment analysis has opted for form using textual data to help businesses monitor Brad and product estimates in customer feedback and understand customer needs. Developing models is extremely important in item analysis because he could develop very specific models for a specific scenario to analyze the customer's sentiment. The more a business knows about a customer, the better experience the customer will have and the better overall business. We will use python to build features and a logistical regression model. First, we would understand the textured data and the procedure features from the data set in detail. We will build our logistical regression model to predict the sentiment of the data.
# importing necessary libraries
# nltk libraries is used for sentiment analysis
nltk.download('stopwords')
import nltk, re, string
from nltk.corpus import stopwords, twitter_samples #nltk.corpus is used for text analysis and to train models.
import numpy as np
#pickle is used to load an store models
import pickle
from sklearn import metrics
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\PC-8783213\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
# For better understanding of Logistic Regression: https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148
The process_tweet() function is extremely important in our sentiment analysis we are data mining. The function takes in a string from the imported texts from the Tweet Library and processes the string data in various ways. We are using the re lib and re.sub function to remove unwanted parts of the strings. This is done to start the removal of all processes of unwanted data. Reg expression is used to remove r'\$\w*' and r'https?:\/\/.*[\r\n]*' and r'#'. We were trying to remove any $ , # , https I may be in the text string that we need to get removed because they would have fact or analysis. nltk.TweetTokenizer() Function converts the string into a list of words that can be better managed in our data analysis process. Ex: 'I am having a good day' => list['I','am','having', 'a','good', 'day'].
Part of the cleaning of The Tweet strings is also using the nltk.PorterStemmer() function is used EX: running => run or waking => walk. It's takes the word into its root part of the word so it can be easier news, and they announce the process of developing our regression model.
# Preprocessing of the tweets that is our data
def process_tweet(tweet):
stemmer = nltk.PorterStemmer()
stopwords_english = stopwords.words('english')
#using the re lib and sub funciton to remove unwanted parts of the strings.
tweet = re.sub(r'\$\w*', '', tweet)
tweet = re.sub(r'^RT[\s]+', '', tweet)
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
tweet = re.sub(r'#', '', tweet)# removing the #
# we use the tokenizer function, we take a set of inputs and return a set of words.
# Ex: 'I am having a good day' => list['I','am','having', 'a','good', 'day']
tokenizer = nltk.TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
#Now we are using a for loop for further cleaing
#This function stems the word to the most basic part of the funciton
# Ex: learing => learn or works => work
tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and
word not in string.punctuation):
stem_word = stemmer.stem(word) # stemming word using the stremmer fucntion
tweets_clean.append(stem_word)
return tweets_clean
# This is the most important part of the whole code:
# The reason is our feature set through which we will be training our model on will be build here.
def build_freqs(tweets, ys):
"""Build frequencies.
Input:
tweets: a list of tweets
ys: an m x 1 array with the sentiment label of each tweet
(either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its
frequency
"""
# Convert np array to list since zip needs an iterable.
# The squeeze is necessary or the list ends up with one element.
# Also note that this is just a NOP if ys is already a list.
yslist = np.squeeze(ys).tolist()
# Start with an empty dictionary and populate it by looping over all tweets
# and over all processed words in each tweet.
freqs = {}
for y, tweet in zip(yslist, tweets):
#the loop uses the process_tweet function to clean the string.
for word in process_tweet(tweet):
pair = (word, y)
if pair in freqs:
freqs[pair] += 1
else:
freqs[pair] = 1
return freqs
The most important part of the code is the build_freqs(tweets,ys) function. This is the function that will be training our logistic regression model. This is the function that will be determined in the sentiments of our overall dataset based on a couple of factors. The main purpose of this function is to count the number of frequencies of words developed in the vocabulary of our data set. For example, say the word “learning” For example, during the training process word has come to the function, say 12 times, then the number of frequencies is 12 for the particular word “learning.” For example, each time the word has come into the function and has been processed, it was in a positive tweet. So the input is a list of a tweet that the variable ys takes in and has a sentiment label with each tweet. That sentiment label is either a 0 or 1.
Then the output is a dictionary that maps the (word, sentiment) = pair, then the variable pair of the word that comes into the function, and the frequency of the pair combination is counted. This is what is returned outside of the function, which an out is a dictionary named freqs. This the positive or negative classes send to develop our model.
# Here, is a example how our funciton works with an example.
#We have a list or words that is assign to tweets
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
# ys is just a sample that come from our nltk.corpus library but here we have a sample
ys = [1, 0, 0, 0, 0]
res = build_freqs(tweets, ys)
print(res)
{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}
As we can see here that we get a dictionary of our output. The funcitons are downing thier job we are only getting the most base is word. build_freqs(tweets, ys) in the dicationary our {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2} our key= ('happi', 1) and our value = 1 for where res[0]. Then output corresponds to the number of frequencies of the word and the sentiment value, which could be either a 1 or 0. This is used to develop some measurement so that we can start training and develop our model. Please see diagram.
As we can see from the example of our functions, as our datasets increase, we will get more numerical values interpretation of 0 and 1. Once again, logistical regression models go as to predict whether a binary response of the output takes a value of 0 or 1. Based on that output, we can start developing a model that correlates the percentage difference between 1 to 0 to sentiment, this way we could begin to develop our model.
#Here we are downloading our datasets needed to develope our model.
nltk.download('twitter_samples')
nltk.download('stopwords')
[nltk_data] Downloading package twitter_samples to [nltk_data] C:\Users\PC-8783213\AppData\Roaming\nltk_data... [nltk_data] Unzipping corpora\twitter_samples.zip. [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\PC-8783213\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
We have downloaded the twitter datasets Twitter_samples form the nltk library. In this Library there is a 'positive_tweets.json’ and 'negative_tweets.json' string files that we are going to use to start developing our model based on the binary input. Keep in mind 'positive_tweets.json’ = 1 and 'negative_tweets.json' = 0. These libraries have already developed bass words that are considered to be negative or positive and with that we could start training our model.
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
This part is important because we need to develop test and train datasets for both the positive and negative tweets string that we have acquired in our variables all_positive_tweets and all_negative_tweets.
# split the data into two pieces, one for training and one for testing. we split the strings into test and train samples
# for both neg and pos.
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]
Now we have our independent variables and we can start developing our equations in order to start training our models. This is the first process in logistical regression modeling. We have out independent variables x for our train and test data sets.
train_x = train_pos + train_neg
test_x = test_pos + test_neg
Now that we our independent variables train_x and test_x which are based on negative and positive sentiment from tweets strings. The next step is to develop a equation called Y is the target variable. Now here we are counting all the positives and negative words from train_pos, train_neg, test_pos, test_neg. Now we have a Y target variable for both train_y and test_y And now we are counting the number of 1's from the positive trian_pos and test_pos variables and the number of zeros in the negative trian_neg and test_neg variable and assigning them to the Y tragert varible available for both the training set and the testing set.
# Combine positive and negative labels
# We are building our y - target variable here
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
print("train_y: ",train_y)
print("test_y: ",test_y)
train_y: [[1.] [1.] [1.] ... [0.] [0.] [0.]] test_y: [[1.] [1.] [1.] ... [0.] [0.] [0.]]
What we were doing above was getting a small sample of the strings of positive and negative sentiment who developed a training model. We were building a reference point in order to start training or model. Now we're going to do the same thing with our Data said not using are created functions for our dictionary in order to develop our model.
# create frequency dictionary using train_x and train_y
freqs = build_freqs(train_x, train_y)
# check out the output of we get when using our build_freq funciton
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))
type(freqs) = <class 'dict'> len(freqs) = 11339
As we can see our output we have a function that has 11,339 elements in the dictionary list named freqs.
# Here is a test the function we can see train_x[22]
print('This is an example of a positive tweet: \n', train_x[22])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[22]))
This is an example of a positive tweet: @gculloty87 Yeah I suppose she was lol! Chat in a bit just off out x :)) This is an example of the processed version of the tweet: ['yeah', 'suppos', 'lol', 'chat', 'bit', 'x', ':)']
We can see the raw string from the positive tweet The string has the email address and the happy face stuff that needs to be cleaned and processed. The process_tweet() function is used. As we can see that the function converts the row string into a list that have been processed using Reg expression and we only get the base word of the sentence. If we see that we use the using the stremmer fucntion and we see the word suppos need to be suppos to ge the base word their is some issues with some words.
Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability.
Logistic Regression a Linear Regression model but the Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.
What is the Sigmoid Function? In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1.
We are going to develop our own logistical regression model based on the above equation. The function will be called sigmoid(). Then we can we numpy to develop the logistic regression model. We are converting the equations in two functions in order to develop our sigmoidfuction().
# We are going to build are will Logistic Regression Model.
# Logistic regression
# Sigmoid Function
def sigmoid(z):
"""
Input:
z: is the input (can be a scalar or an array)
Output:
h: the sigmoid of z
"""
zz = np.negative(z)
h = 1 / (1 + np.exp(zz))
return h
Typical properties of the logistic regression equation include:
Cost Function We learn about the cost function in the Linear regression, the cost function represents optimization objective i.e. we create a cost function and minimize it so that we can develop an accurate model with minimum error.
We use what is defined for a cost function and we translate this into code using Numpy and deveople the funciton. Our cost funciton in is used in the gradientDescent() funcaiton.
Gradient Descent now the question arises, how do we reduce the cost value. Well, this can be done by using Gradient Descent. Now to minimize our cost function we need to run the gradient descent function on each parameter then we need theta.
This is the theta function that is used in our gradientDescent() funcaiton.
the gradient descent is important in developing values in order for us to calculate and get a output needed in the machine learning process. This action is analogous to calculating the gradient descent, and taking a step is analogous to one iteration of the update to the parameters.
# Cost function and Gradient
def gradientDescent(x, y, theta, alpha, num_iters):
"""
Input:
x: matrix of features which is (m,n+1)
y: corresponding labels of the input matrix x, dimensions (m,1)
theta: weight vector of dimension (n+1,1)
alpha: learning rate
num_iters: number of iterations you want to train your model for
Output:
J: the final cost
theta: your final weight vector
Hint: you might want to print the cost to make sure that it is going down.
"""
# get 'm', the number of rows in matrix x
m = x.shape[0]
for i in range(0, num_iters):
z = np.dot(x, theta)
h = sigmoid(z)
# calculate the cost function
cost = -1. / m * (np.dot(y.transpose(), np.log(h)) + np.dot((1 - y).transpose(), np.log(1 - h)))
# update the weights theta
# theta is called the learning function
theta = theta - (alpha / m) * np.dot(x.transpose(), (h - y))
cost = float(cost)
return cost, theta
# Extracting the features
def extract_features(tweet, freqs):
"""
Input:
tweet: a list of words for one tweet
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
Output:
x: a feature vector of dimension (1,3)
"""
word_l = process_tweet(tweet)
x = np.zeros((1, 3))
# bias term is set to 1
x[0, 0] = 1
for word in word_l:
# increment the word count for the positive label 1
x[0, 1] += freqs.get((word, 1.0), 0)
# increment the word count for the negative label 0
x[0, 2] += freqs.get((word, 0.0), 0)
assert (x.shape == (1, 3))
return x
# test on training data
tmp1 = extract_features(train_x[22], freqs)
print(tmp1)
[[1.000e+00 3.006e+03 1.240e+02]]
Let us try to understand what all these three numbers mean from our test on training data above mean. We usually we get a dataset with a lot of features, such as columns. However, here we just have text data. Those [[1.000e+00 3.006e+03 1.240e+02]] three numbers are the feature set that we have build using build_freq() and extract_features() function. We know that our build_freq() function builds a dictionary having words as keys and the number of times they have occurred in corpus as values. Then extract_feature function takes the sum of these values for positive and negative words, i.e. tmp1[1] and tmp[2]
How these features will be used to predict in Logistic Regression? First a hypothesis is build which for our case will be
# Training Your Model
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
X[i, :] = extract_features(train_x[i], freqs)
# training labels corresponding to X
Y = train_y
# Apply gradient descent
# these values are predefined (Andrew NG)
J, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
Our predict_tweet() function takes all our function that we develop in order to develop a regression model to predict tweet sentiment. However the last part of the puzzle to determine how accurate or prediction for sentiment function is in order to determine if it is good at predicting sentiment.
def predict_tweet(tweet, freqs, theta):
"""
Input:
tweet: a string
freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
theta: (3,1) vector of weights
Output:
y_pred: the probability of a tweet being positive or negative
"""
# extract the features of the tweet and store it into x
x = extract_features(tweet, freqs)
y_pred = sigmoid(np.dot(x, theta))
return y_pred
def test_logistic_regression(test_x, test_y, freqs, theta):
"""
Input:
test_x: a list of tweets
test_y: (m, 1) vector with the corresponding labels for the list of tweets
freqs: a dictionary with the frequency of each pair (or tuple)
theta: weight vector of dimension (3, 1)
Output:
accuracy: (# of tweets classified correctly) / (total # of tweets)
"""
# the list for storing predictions
y_hat = []
for tweet in test_x:
# get the label prediction for the tweet
y_pred = predict_tweet(tweet, freqs, theta)
if y_pred > 0.5:
y_hat.append(1)
else:
y_hat.append(0)
accuracy = (y_hat == np.squeeze(test_y)).sum() / len(test_x)
return accuracy
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")
Logistic regression model's accuracy = 0.9950
As you can see from our logistic regression models accuracy it is at 99.50 Which is extremely well and is highly recommend that we move forward and you are predict_tweet() function to develop a function to be able to predict the sentiment of the string input
# Predict with your own tweet function.
def predict(sentence):
yhat = predict_tweet(sentence, freqs, theta)
if yhat > 0.51:
return 'Positive sentiment'
elif yhat == 0:
return 'Neutral sentiment'
else:
return 'Negative sentiment'
my_tweet = 'It is a very bad day'
res = predict(my_tweet)
print(res)
Negative sentiment
my_tweet = 'It is a very good day'
res = predict(my_tweet)
print(res)
Positive sentiment
As you can see that our function was able to predict sentiment based on the string and put that we are putting in the function. Our project has been very successful in determining sentiment analysis using logistic regression model