Analytics and Visualization of Big Data: Sentiment Analysis with Rapidminer

Tuesday, February 12, 2013

Sentiment Analysis with Rapidminer

Sentiment analysis or opinion mining is an application of Text Analytics to identify and extract subjective information in source materials.

A basic task in sentiment analysisis classifying an expressed opinion in a document, a sentence or an entity feature as positive or negative. This tutorial explains the usage of sentiment analysis in Rapidminer. The example presented here gives the list of movies and its review such as Positive or Negative. This program implements Precision and Recall method. Precision is the probability that a (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly selected) relevant document is retrieved in a search. Or high recall means that an algorithm returned most of the relevant results. High precision means that an algorithm returned more relevant results than irrelevant.

At first, both positive and negative reviews of a certain movie are taken. All of the words are stemmed into root words. Then the words are stored in different polarity(positive and negative). Both vector wordlist and model are created. Then, the required list of movies is given as an input. Model compares each and every word from the given list of movies with that of words which come under different polarity stored earlier. The movie review is estimated based on the majority of number of words that occur under a polarity. For example, when you look at Django Unchained, the reviews are compared with the vector wordlist created at the beginning. The highest number of words comes under positive polarity. So the outcome is Positive. Same happens for Negative outcome.

First step for implementing this analysis is Processing the document from data i.e. extracting the positive and negative reviews of a movie and storing it in different polarity. The model is shown in Figure 1.

Figure 1

Under Process document, click on the Edit List on the right. Load the positive and negative reviews under different class name "Positive" and "Negative" as shown in Figure 2.

Figure 2

Under Process Document operator, nested operation takes place such as Tokenizing the words, Filtering the Stop words, Stemming the words into root words and Filtering the tokens between 4 and 25 characters as shown in Figure 3.

Figure 3

Then two operators are used such as Store and Validation operator as shown in Figure 1. Store operator is used to output word vector to a file and directory of our choosing. Validation operator(Cross-validation) is a standard way to assess the accuracy and validity of a statistical model. Our data set is divided into two parts, a training set and a test set. The model is trained on the training set only and its accuracy is evaluated on the test set. This is repeated n number of times. Double click on validation operator. There will two panels- Training and Testing. Under Training panel, Linear Support Vector Machine(SVM) is used which is a popular set of classifier since the function is a linear combination of all the input variables. In order to test the model, we use the ‘Apply Model’ operator to apply the training set to our test set. To measure the model accuracy we use the ‘Performance’ operator. The operations under Validation is shown in Figure 4.

Figure 4

Then run the model. The result of Class Recall % and Precision % is shown in Figure 5. The model and vector wordlist are stored in a Repository.

Figure 5

Then retrieve both the model and vector wordlist from the Repository you have stored earlier. Then connect out from the retrieve wordlist to the process document operator shown in Figure 6. The operations under Process document is same shown in Figure 3.

Figure 6

Then click on Process Document operator and click edit list on the right. This time I have added the list of 5 movie reviews from Rottentomatoes website and stored it in a directory. Assign the class name as unlabeled shown in Figure 7.

Figure 7

The Apply Model operator takes a model from a Retrieve operator and unlabeled data from Process document as input and outputs the applied model to the ‘lab’ port, so connect that to the ‘res’ (results) port. The result is shown below. When you look at Les Miserables, there is 86.4% confidence that it is positive and 13.6% as negative because the match of the reviews with wordlist under positive polarity is higher compared to negative polarity.