Sentiment Analysis Using Logistic Regression Model

Sentiment analysis or opinion mining is a natural language processing technique used to determine whether data is positive-negative or neutral. Sentiment analysis has opted for form using textual data to help businesses monitor Brad and product estimates in customer feedback and understand customer needs. Developing models is extremely important in item analysis because he could develop very specific models for a specific scenario to analyze the customer's sentiment. The more a business knows about a customer, the better experience the customer will have and the better overall business. We will use python to build features and a logistical regression model. First, we would understand the textured data and the procedure features from the data set in detail. We will build our logistical regression model to predict the sentiment of the data.

The process_tweet() function is extremely important in our sentiment analysis we are data mining. The function takes in a string from the imported texts from the Tweet Library and processes the string data in various ways. We are using the re lib and re.sub function to remove unwanted parts of the strings. This is done to start the removal of all processes of unwanted data. Reg expression is used to remove r'\$\w*' and r'https?:\/\/.*[\r\n]*' and r'#'. We were trying to remove any $ , # , https I may be in the text string that we need to get removed because they would have fact or analysis. nltk.TweetTokenizer() Function converts the string into a list of words that can be better managed in our data analysis process. Ex: 'I am having a good day' => list['I','am','having', 'a','good', 'day'].

Part of the cleaning of The Tweet strings is also using the nltk.PorterStemmer() function is used EX: running => run or waking => walk. It's takes the word into its root part of the word so it can be easier news, and they announce the process of developing our regression model.

The most important part of the code is the build_freqs(tweets,ys) function. This is the function that will be training our logistic regression model. This is the function that will be determined in the sentiments of our overall dataset based on a couple of factors. The main purpose of this function is to count the number of frequencies of words developed in the vocabulary of our data set. For example, say the word “learning” For example, during the training process word has come to the function, say 12 times, then the number of frequencies is 12 for the particular word “learning.” For example, each time the word has come into the function and has been processed, it was in a positive tweet. So the input is a list of a tweet that the variable ys takes in and has a sentiment label with each tweet. That sentiment label is either a 0 or 1.

Then the output is a dictionary that maps the (word, sentiment) = pair, then the variable pair of the word that comes into the function, and the frequency of the pair combination is counted. This is what is returned outside of the function, which an out is a dictionary named freqs. This the positive or negative classes send to develop our model.

image.png

As we can see here that we get a dictionary of our output. The funcitons are downing thier job we are only getting the most base is word. build_freqs(tweets, ys) in the dicationary our {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2} our key= ('happi', 1) and our value = 1 for where res[0]. Then output corresponds to the number of frequencies of the word and the sentiment value, which could be either a 1 or 0. This is used to develop some measurement so that we can start training and develop our model. Please see diagram.

image-6.png image-9.png

As we can see from the example of our functions, as our datasets increase, we will get more numerical values interpretation of 0 and 1. Once again, logistical regression models go as to predict whether a binary response of the output takes a value of 0 or 1. Based on that output, we can start developing a model that correlates the percentage difference between 1 to 0 to sentiment, this way we could begin to develop our model.

We have downloaded the twitter datasets Twitter_samples form the nltk library. In this Library there is a 'positive_tweets.json’ and 'negative_tweets.json' string files that we are going to use to start developing our model based on the binary input. Keep in mind 'positive_tweets.json’ = 1 and 'negative_tweets.json' = 0. These libraries have already developed bass words that are considered to be negative or positive and with that we could start training our model.

This part is important because we need to develop test and train datasets for both the positive and negative tweets string that we have acquired in our variables all_positive_tweets and all_negative_tweets.

Now we have our independent variables and we can start developing our equations in order to start training our models. This is the first process in logistical regression modeling. We have out independent variables x for our train and test data sets.

Now that we our independent variables train_x and test_x which are based on negative and positive sentiment from tweets strings. The next step is to develop a equation called Y is the target variable. Now here we are counting all the positives and negative words from train_pos, train_neg, test_pos, test_neg. Now we have a Y target variable for both train_y and test_y And now we are counting the number of 1's from the positive trian_pos and test_pos variables and the number of zeros in the negative trian_neg and test_neg variable and assigning them to the Y tragert varible available for both the training set and the testing set.

What we were doing above was getting a small sample of the strings of positive and negative sentiment who developed a training model. We were building a reference point in order to start training or model. Now we're going to do the same thing with our Data said not using are created functions for our dictionary in order to develop our model.

As we can see our output we have a function that has 11,339 elements in the dictionary list named freqs.

We can see the raw string from the positive tweet The string has the email address and the happy face stuff that needs to be cleaned and processed. The process_tweet() function is used. As we can see that the function converts the row string into a list that have been processed using Reg expression and we only get the base word of the sentence. If we see that we use the using the stremmer fucntion and we see the word suppos need to be suppos to ge the base word their is some issues with some words.

Logistic Regression is a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability.

image.png

Logistic Regression a Linear Regression model but the Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.

The hypothesis of logistic regression tends it to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.

image-2.png

What is the Sigmoid Function? In order to map predicted values to probabilities, we use the Sigmoid function. The function maps any real value into another value between 0 and 1.

image.png

We are going to develop our own logistical regression model based on the above equation. The function will be called sigmoid(). Then we can we numpy to develop the logistic regression model. We are converting the equations in two functions in order to develop our sigmoidfuction().

image-2.png

Typical properties of the logistic regression equation include:

  1. Logistic regression’s dependent variable obeys ‘Bernoulli distribution.’
  2. Estimation/prediction is based on ‘maximum likelihood.’
  3. Logistic regression does not evaluate the coefficient of determination (or R squared) as observed in linear regression’.

Cost Function We learn about the cost function in the Linear regression, the cost function represents optimization objective i.e. we create a cost function and minimize it so that we can develop an accurate model with minimum error.

image-4.png image-5.png

We use what is defined for a cost function and we translate this into code using Numpy and deveople the funciton. Our cost funciton in is used in the gradientDescent() funcaiton. image-6.png

Gradient Descent now the question arises, how do we reduce the cost value. Well, this can be done by using Gradient Descent. Now to minimize our cost function we need to run the gradient descent function on each parameter then we need theta.

image-7.png

This is the theta function that is used in our gradientDescent() funcaiton.

image-8.png

the gradient descent is important in developing values in order for us to calculate and get a output needed in the machine learning process. This action is analogous to calculating the gradient descent, and taking a step is analogous to one iteration of the update to the parameters.

Let us try to understand what all these three numbers mean from our test on training data above mean. We usually we get a dataset with a lot of features, such as columns. However, here we just have text data. Those [[1.000e+00 3.006e+03 1.240e+02]] three numbers are the feature set that we have build using build_freq() and extract_features() function. We know that our build_freq() function builds a dictionary having words as keys and the number of times they have occurred in corpus as values. Then extract_feature function takes the sum of these values for positive and negative words, i.e. tmp1[1] and tmp[2]

How these features will be used to predict in Logistic Regression? First a hypothesis is build which for our case will be

image-2.png

Our predict_tweet() function takes all our function that we develop in order to develop a regression model to predict tweet sentiment. However the last part of the puzzle to determine how accurate or prediction for sentiment function is in order to determine if it is good at predicting sentiment.

As you can see from our logistic regression models accuracy it is at 99.50 Which is extremely well and is highly recommend that we move forward and you are predict_tweet() function to develop a function to be able to predict the sentiment of the string input

As you can see that our function was able to predict sentiment based on the string and put that we are putting in the function. Our project has been very successful in determining sentiment analysis using logistic regression model