Analytics and Visualization of Big Data: RapidMiner

Showing posts with label RapidMiner. Show all posts

Tuesday, February 12, 2013

Sentiment Analysis with Rapidminer

Sentiment analysis or opinion mining is an application of Text Analytics to identify and extract subjective information in source materials.

A basic task in sentiment analysisis classifying an expressed opinion in a document, a sentence or an entity feature as positive or negative. This tutorial explains the usage of sentiment analysis in Rapidminer. The example presented here gives the list of movies and its review such as Positive or Negative. This program implements Precision and Recall method. Precision is the probability that a (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly selected) relevant document is retrieved in a search. Or high recall means that an algorithm returned most of the relevant results. High precision means that an algorithm returned more relevant results than irrelevant.

At first, both positive and negative reviews of a certain movie are taken. All of the words are stemmed into root words. Then the words are stored in different polarity(positive and negative). Both vector wordlist and model are created. Then, the required list of movies is given as an input. Model compares each and every word from the given list of movies with that of words which come under different polarity stored earlier. The movie review is estimated based on the majority of number of words that occur under a polarity. For example, when you look at Django Unchained, the reviews are compared with the vector wordlist created at the beginning. The highest number of words comes under positive polarity. So the outcome is Positive. Same happens for Negative outcome.

First step for implementing this analysis is Processing the document from data i.e. extracting the positive and negative reviews of a movie and storing it in different polarity. The model is shown in Figure 1.

Figure 1

Under Process document, click on the Edit List on the right. Load the positive and negative reviews under different class name "Positive" and "Negative" as shown in Figure 2.

Figure 2

Under Process Document operator, nested operation takes place such as Tokenizing the words, Filtering the Stop words, Stemming the words into root words and Filtering the tokens between 4 and 25 characters as shown in Figure 3.

Figure 3

Then two operators are used such as Store and Validation operator as shown in Figure 1. Store operator is used to output word vector to a file and directory of our choosing. Validation operator(Cross-validation) is a standard way to assess the accuracy and validity of a statistical model. Our data set is divided into two parts, a training set and a test set. The model is trained on the training set only and its accuracy is evaluated on the test set. This is repeated n number of times. Double click on validation operator. There will two panels- Training and Testing. Under Training panel, Linear Support Vector Machine(SVM) is used which is a popular set of classifier since the function is a linear combination of all the input variables. In order to test the model, we use the ‘Apply Model’ operator to apply the training set to our test set. To measure the model accuracy we use the ‘Performance’ operator. The operations under Validation is shown in Figure 4.

Figure 4

Then run the model. The result of Class Recall % and Precision % is shown in Figure 5. The model and vector wordlist are stored in a Repository.

Figure 5

Then retrieve both the model and vector wordlist from the Repository you have stored earlier. Then connect out from the retrieve wordlist to the process document operator shown in Figure 6. The operations under Process document is same shown in Figure 3.

Figure 6

Then click on Process Document operator and click edit list on the right. This time I have added the list of 5 movie reviews from Rottentomatoes website and stored it in a directory. Assign the class name as unlabeled shown in Figure 7.

Figure 7

The Apply Model operator takes a model from a Retrieve operator and unlabeled data from Process document as input and outputs the applied model to the ‘lab’ port, so connect that to the ‘res’ (results) port. The result is shown below. When you look at Les Miserables, there is 86.4% confidence that it is positive and 13.6% as negative because the match of the reviews with wordlist under positive polarity is higher compared to negative polarity.

Figure 8

Friday, February 8, 2013

Simple model to generate association rules in RapidMiner

In this post, I am going to show how to build a simple model to create association rules in RapidMiner. To demonstrate the process, I created an example based on the Health Care example presented in the page 6 of the 8^th lecture material. In this example, the possibility of having two different side effects is considered based on consuming a combination of 6 different drugs. At first, the mentioned table was generated in CSV format and then it was imported to the RapidMiner. As it can be seen in figure 1 and 2 the input table has 9 attributes which all are binomial types except the PID attribute which is integer.

Figure 1.

Figure 2.

For purpose of rule generation, we need to use FP-Growth operator which just accepts the binomial attributes. Since, we do not need PID attribute in our model, we are going to exclude it by using Select Attributes operator. Add Select Attributes operator to the process window and connect it to the input data. In Attribute Filter Type drop box select Subset and press the Select Attributes button. The Select attributes windows is displayed like figure 3.

Figure 3

Add all binomial attributes to the Selected Attribute window as indicated in figure 4.

Figure 4

In the search field in the operator tab, search for FP-Growth Operator and add it to your model. The FP in FP-Growth stands for Frequency Pattern. Frequency pattern analysis is used for many kinds of data mining, and is a necessary component of association rule mining. Without having frequencies of attribute combinations, we cannot determine whether any of the patterns in the data occur often enough to be considered rules. One important parameters of this operator is Min Support. It is the number of times that the rule did occur, divided by the number of observations in the data set. for this example we leave its default value.

Figure 5

Run the model and switch to the result window (Figure 6). It seems that some of our attributes appear to have some frequent patterns in them. In fact, in this example lots of frequent patterns are observed because our example has a few data. If your model does not generate any frequent pattern, you may need to adjust the Min Support percent and decrease it until you get the reasonable response.

We can investigate the possible connection further by adding one final operator to our model.

Figure 6

In the search field in the operator lab, search for Create Association Rules operator and drag it to your model, as illustrated in figure 7. This operator takes in frequent pattern matrix data and seeks out any patterns that occur so frequently that could be considered as rules. The Create Association Rules operator generates both a set of rules (through the rul port) and a set of associated items (through the ite port). In this model we are looking just for generating rules, so we simply connect the its rul port to the res port of the process window.

Figure 7

One of the influential parameters of this operator is Min Confidence. Confident percent is a measure of how confident we are that when one attribute is flagged as true, the associated attribute will also be flagged as true. It is gained by dividing the number of times that a rule occurs by the number of times that it could have occurred. If your model does not generate any rules you may need to decrease the confidence percent. In this example, since we have used just limited numbers of non-real input data, we ended up with lots of rules with high confident percentages which is not a case in real world problem.

As you see in figure 8, lots of redundant rules were generated which have either side effects (SE#) in Premises or drugs (D#) in Conclusion. By looking at the input data, one can say that the correct rules should contain drugs in Premises and side effects in Conclusion. So, as you see in the below picture, just this type of rules is highlighted by red arrows and the rest are redundant,

Figure 8

Again one should take it into consideration that the limited number of non-real input data and special types of data in this example led to generating lots of rules with high levels of confidence and support. In a real problem this is not the case and usually limited numbers of rules are generated with high or moderate levels of confidence and support.

Friday, February 1, 2013

How to define Date attributes in RapidMiner

In this post I am going to show a method to prepare the OSHA data for text mining in RapidMiner. The first step is to import the raw data into the RapidMiner. There are different ways to import the data into the RapidMiner. In this post, we are going to use ‘Import CSV’ function. In Repositories area, the second icon from the left contains the ‘Import CSV File’ option. Click on it as indicated by the red rectangular in the figure 1.

Figure 1.

When the Data Import Wizard opens, navigate to the location where you have stored the OSHA data set, then click Next.

Figure 2.

In the second step choose the comma for column separation, as indicated in figure 3, then click Next.

Figure 3.

At step 3, we are able to identify the first row as the attributes’ name. Click on the annotation drop box on the first row and select ‘Name” as indicated in figure 4. Click Next.

Figure 4.

In the next window, we are able to define the data type of each attribute. RapidMiner proposes its best guess for each attribute and we have the options to accept the proposed data types and change them later by Operators or modify them now. Let’s change them at this step. We know that the second and third columns are date type, so change the attribute type of these columns to ‘Date’ and then define the date format in the highlighted box in figure 5. RapidMiner proposes some predefined date and time format which can be selected from the drop box, but in this case, none of these options match our date format; therefore, we need to define our desired format. As it is indicated in figure 5, the format of our date columns are ‘MM/dd/yyyy’, so type this format in the “Date format” box. Notice that, you should enter the month in upper case and the days and years in lower cases, Otherwise RapidMiner will not distinguish the months and consider all months as the January.

You see there is a check box at above each attribute. If you do not want to import a particular column, you may just uncheck its box at this step.

Figure 5.

The final step is to store the data set in Repository folder and give it a name.

Figure 6.

Now, we can see that the OSHA data set is available under Repository folder. To add this data set to our model, we should drag the OSHA icon and release it at the Process windows. Your Process windows should look like figure 7.

Figure 7

Now, if you run the model the results should look like figures 8 and 9.

Figure 8

Figure 9

As you can see, the attributes "Summary Report Date" and "Date of Incident" are date types and the last two attributes are text types.

As I mentioned before, if you accept the default attribute types while importing the data set into the RapidMiner, you always have the chance to change their type or their role later in your model. In Operators area, under ‘Data Transformation’ folder tree, you will see the “Type Conversion” folder which contains different Operators to convert various attribute types.

Figure 10

Assume that for our analysis, we need to extract the incident months. To do this, drag the “Date to Numerical “ Operator to the Process window and set up its properties as it indicates in figure 11.

Figure 11

Since we checked the box “Keep old attribute” , Rapid miner keeps the old attribute and add the new attribute to our model. We may like to rename the default name proposed by RapidMiner. In the search box of the Operators area type “Rename” and then add the Rename operator to our model and set its properties as it indicates in figure 12.

Figure 12

Run the model and make sure your results is look like figures 13 and 14.

Figure 13

Figure 14

Thursday, January 31, 2013

Cumulus Clouds

OK, so today’s class (January 31, 2013) for me seemed to be very computer science oriented. I too, was a little overwhelmed at first when I started working with cloud computing. I am very accustomed to working with data that is found locally on my computer and thus I am very acquainted to plugging in a flash drive, saving the data to some local directory and opening Excel. This is most familiar to me because, like the rest of you, it is what was everyday practice growing up. Before we get into cloud computing, lets think about a few things…

What is the largest flash drive you own?

What is the largest external (or internal) hard drive you are using?

I’m going to venture out on a limb and make a gross assumption that no one has any local file storage capacities larger than 5 terabytes. That is a very large hard drive, which can store lots of information (for a single person). Now imagine you work for a company that mines (analyzes) twitter and Instagram. Your company has been hired by the National Football League (NFL) to store all twitter posts and Instagram photos relating to the playoff games, commercials spots, and lastly the Superbowl. The league wants to see how social media is “playing out” during the games. They want to use this social media data to increase the prices of the commercial seconds in the future.

(http://www.dailymail.co.uk/news/article-2082140/Super-Bowl-ads-sell-record-3-5m-EACH-just-30-seconds.html)

All of that data will NOT store on one, two, or even three computers; so the question becomes: where do you store all of this data? Well, your company can spend a tremendous amount of money to buy lots of hard drives to store all of this data but this could be very costly if your sales team does not have another great lead on a job. Your data storage should be flexible given the demand you might have. Now, you remember of this class you took on data mining in college and recognize a potential solution to the problem! You introduce to your superior the idea of cloud computing and NOW because of this, your company invests in cloud storage. This allows you to buy space, as it is needed.

Imagine the NFL comes to your office the week after the Super Bowl and says that they want to know how often a particular word or combination of words was used. The league wants to show the power of advertising. How can you do this?

*Yikes, in the past, we could easily pull up Excel but the data is so large that Excel will not help and it is located on the Amazon Cloud. This is where the Python code comes into play. By implementing a few simple commands, you can export valuable information regarding what is going on with the data.

In class today we looked at word count. Lets put this idea to use:

1. We have an extremely large file on the Amazon Cloud.

2. We wish to examine the word count to show us how often certain words are used. (**In the NFL example, this could be touchdown, 49ers, Ravens, and the list can go on).

3. We can use a simple python code found by ordinarily Google-ing for it.

4. Once we have performed the job and saved the results in an output file, we can then use “Orange,” “RapidMiner” and even Excel to visualize the results.

In the project we outlined in class, we have a text file that shows word count. Simply opening this in Excel and sorting shows that the word most commonly used is the word “the” with it being mentioned 31 times, followed by the word “to.” As you can see, data mining can be quite easy and informative. The data though we mind might be extremely large and we are unable to perform such tasks using our local hard drives. This is where the advantages of cloud computing comes into fruition.

Monday, January 28, 2013

Prototype on Paper - App Development

Thus far, we have been covering differing methods of data mining using applications such as RapidMiner and Orange. We've begun to discuss the framework associated with extracting relevant data and displaying that in an understandable way. Therefore, the next step will be considering the audience that this information will be shared with, our customer.

We must consider the idea that the amount of people making decisions in politics, business, and service industries are not necessarily skilled statisticians. Nor are they skilled in the tools to extract data as we are. So, the question becomes: How can we allow the user (who is not a mathemetician or statistician) to access relevant information and make decisions based on it without a baby-sitter? Well, in order to answer this question, we must first think like a designer...

First, we need to empathize with the customer/user and understand his/her environment and motivations. Then, focus in on the things that he/she holds as valuable. Next, generate a number of different ideas that vary in order to arrive at a tool that will meet the needs of the customer.

*This idea of design thinking will be something I post about in the near future, but isn't a significant part of the context of what we're discussing. However, it is important to think about if you're considering using this tool to develop a prototype.

So, after we've identified elements of a tool.. what next? We have to prototype and make something, right? Well, what if the answer you've arrived at isn't something you know how to make... say an iOS application?

That's where the Prototype on Paper iOS app comes in. This application allows you to literally DRAW out exactly how you see an app being mapped out and make it. Thus, an engineer with next to zero knowledge on app development can communicate and show a developer what he's thinking and how he/she arrived at the idea. However, this also suggests a new way to look at app development.

Currently, app development is somewhat of a mystical process to those that aren't in the "know". A great deal of time is spent on them so they can be readily available for mass spread. BUT, what if the market changed from public focus to individual? What if instead of spending months on creating an app for the public, you could make a quick and dirty app that had very few functions, but worked for the small scope that you needed it to? This is a really neat thought and something to definitely talk about more, but for now I'm focusing on the instance where I need to make an app that serves a specific purpose and I want to see how my user will interact.

For example, I'm working with the Lee County Emergency Management Agency on how they approach natural disaster relief. One of the specific areas we're analyzing is how social media is considered. On April 27, 2012 there were a series of horrific tornadoes that swept through our state. Because of the devastating carnage that ensued, 911 operators were tied up and those in peril could not contact anyone to let them know their plight. So, being resourceful, these people turned to social media to let anyone and everyone know what was wrong, where they were, and what they needed. This in effect, created a whole litany of other problems but the one we'll consider for the sake of this conversation was that this information was not going to the right people. Emergency responders were not notified of these people that were in need of help and therefore could not coordinate the proper relief efforts. So, people were rushing to help while wearing flip-flops and t-shirts and then stepping on rusty nails and becoming another victim in the picture. This image leaves us with some very distinct needs. The entity that is coordinating needs to have a picture of what information is traveling over local social media channels and have a way to manage tasks and send correspondence of needs/locations to people that can help.

Thus, I developed an app that will allow these things to happen. And here's how I did it:

Download the app "Prototype on Paper" from iTunes
Using some sort of methodology (I used design thinking as defined by the d.school at Stanford) to develop the "pages" of your app. Just like you would a website.
Launch the app
touch the "+" in the top left-hand corner of the homescreen after you've gone through the tutorial.
Enter a title for your app (or project as it's defined in the app)
Begin by touching the camera in the bottom left-hand corner of the screen
Take a picture of each of your "pages"
On the project screen (this is where all of your pictured pages sit in rows), select one of your pages.
On the top right-hand corner of the screen, touch the "+" that is inside a box. A red square will appear on your screen.
Touch and drag the red square to any place on your page where you intend for the user to touch to engage a new page. Resize by dragging one of the square corners at a time.
After reaching the desired location and size, touch the prompt "Link To".
On the next page, select the page you want that button to go to when pressed by the user. Note the bottom of the current page has 5 different selections for how the transition from one page to the next can occur.
After selecting, press "Done" in the top right-hand corner of the page.
Repeat this process until you have placed links to all the buttons on your drawn pages.
When you're ready to test your app, select the play button on either the top right (when close up to one of your drawn pages) or bottom center (when on the project's main page).
Navigate through your app and take note of anything you've forgotten.
If you forgot to paste a link, pinch your fingers together on the screen and go back to step 9.
Most important step, keep in mind you just threw together a quick and dirty app in like an hour. Now, give it to your user and see how they interact with it. Receive their criticism as an anthropologist, not an analyst. After all, what's to get upset about? You just spent a minimal amount of time creating this super useful tool and all you have to do to change it is erase something and draw something new or touch a few buttons.

I've created a video on my Youtube channel to show how this bad-boy works. See Below

I hope you enjoy!