Analytics and Visualization of Big Data: Tutorial

Showing posts with label Tutorial. Show all posts

Friday, February 15, 2013

Geo chart - Response to Motion Chart Tutorial

This is a response to the video tutorial on motion chart.. I found a data set on cigarette smoker rates by state. I thought it would be valuable to create a geo chart so that it would be easy to notice the magnitudes by state. I know it is kind of hard to see in this picture, but in the actual chart you can focus in on certain parts of the map to better inter prelate.

From the heat map produced it is a little easier to see the magnitude of cigarette smoking rates by state. It appears the states with the highest rates are those with more rural populations. Historically cigarette companies have marketed towards the blue collar portion of America (i.e Marlboro and Cowboys), and this helps explain the distribution seen above. It would be interesting to see a motion chart over time of the smoking rates as well..

The data can be found by the following link.

http://www.infochimps.com/datasets/current-cigarette-smoking-by-sex-and-state-2005/downloads/133926

Thursday, February 14, 2013

Video Tutorial - How to create a Motion Chart

Tuesday, February 12, 2013

Sentiment Analysis with Rapidminer

Sentiment analysis or opinion mining is an application of Text Analytics to identify and extract subjective information in source materials.

A basic task in sentiment analysisis classifying an expressed opinion in a document, a sentence or an entity feature as positive or negative. This tutorial explains the usage of sentiment analysis in Rapidminer. The example presented here gives the list of movies and its review such as Positive or Negative. This program implements Precision and Recall method. Precision is the probability that a (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly selected) relevant document is retrieved in a search. Or high recall means that an algorithm returned most of the relevant results. High precision means that an algorithm returned more relevant results than irrelevant.

At first, both positive and negative reviews of a certain movie are taken. All of the words are stemmed into root words. Then the words are stored in different polarity(positive and negative). Both vector wordlist and model are created. Then, the required list of movies is given as an input. Model compares each and every word from the given list of movies with that of words which come under different polarity stored earlier. The movie review is estimated based on the majority of number of words that occur under a polarity. For example, when you look at Django Unchained, the reviews are compared with the vector wordlist created at the beginning. The highest number of words comes under positive polarity. So the outcome is Positive. Same happens for Negative outcome.

First step for implementing this analysis is Processing the document from data i.e. extracting the positive and negative reviews of a movie and storing it in different polarity. The model is shown in Figure 1.

Figure 1

Under Process document, click on the Edit List on the right. Load the positive and negative reviews under different class name "Positive" and "Negative" as shown in Figure 2.

Figure 2

Under Process Document operator, nested operation takes place such as Tokenizing the words, Filtering the Stop words, Stemming the words into root words and Filtering the tokens between 4 and 25 characters as shown in Figure 3.

Figure 3

Then two operators are used such as Store and Validation operator as shown in Figure 1. Store operator is used to output word vector to a file and directory of our choosing. Validation operator(Cross-validation) is a standard way to assess the accuracy and validity of a statistical model. Our data set is divided into two parts, a training set and a test set. The model is trained on the training set only and its accuracy is evaluated on the test set. This is repeated n number of times. Double click on validation operator. There will two panels- Training and Testing. Under Training panel, Linear Support Vector Machine(SVM) is used which is a popular set of classifier since the function is a linear combination of all the input variables. In order to test the model, we use the ‘Apply Model’ operator to apply the training set to our test set. To measure the model accuracy we use the ‘Performance’ operator. The operations under Validation is shown in Figure 4.

Figure 4

Then run the model. The result of Class Recall % and Precision % is shown in Figure 5. The model and vector wordlist are stored in a Repository.

Figure 5

Then retrieve both the model and vector wordlist from the Repository you have stored earlier. Then connect out from the retrieve wordlist to the process document operator shown in Figure 6. The operations under Process document is same shown in Figure 3.

Figure 6

Then click on Process Document operator and click edit list on the right. This time I have added the list of 5 movie reviews from Rottentomatoes website and stored it in a directory. Assign the class name as unlabeled shown in Figure 7.

Figure 7

The Apply Model operator takes a model from a Retrieve operator and unlabeled data from Process document as input and outputs the applied model to the ‘lab’ port, so connect that to the ‘res’ (results) port. The result is shown below. When you look at Les Miserables, there is 86.4% confidence that it is positive and 13.6% as negative because the match of the reviews with wordlist under positive polarity is higher compared to negative polarity.

Figure 8

Monday, January 28, 2013

Prototype on Paper - App Development

Thus far, we have been covering differing methods of data mining using applications such as RapidMiner and Orange. We've begun to discuss the framework associated with extracting relevant data and displaying that in an understandable way. Therefore, the next step will be considering the audience that this information will be shared with, our customer.

We must consider the idea that the amount of people making decisions in politics, business, and service industries are not necessarily skilled statisticians. Nor are they skilled in the tools to extract data as we are. So, the question becomes: How can we allow the user (who is not a mathemetician or statistician) to access relevant information and make decisions based on it without a baby-sitter? Well, in order to answer this question, we must first think like a designer...

First, we need to empathize with the customer/user and understand his/her environment and motivations. Then, focus in on the things that he/she holds as valuable. Next, generate a number of different ideas that vary in order to arrive at a tool that will meet the needs of the customer.

*This idea of design thinking will be something I post about in the near future, but isn't a significant part of the context of what we're discussing. However, it is important to think about if you're considering using this tool to develop a prototype.

So, after we've identified elements of a tool.. what next? We have to prototype and make something, right? Well, what if the answer you've arrived at isn't something you know how to make... say an iOS application?

That's where the Prototype on Paper iOS app comes in. This application allows you to literally DRAW out exactly how you see an app being mapped out and make it. Thus, an engineer with next to zero knowledge on app development can communicate and show a developer what he's thinking and how he/she arrived at the idea. However, this also suggests a new way to look at app development.

Currently, app development is somewhat of a mystical process to those that aren't in the "know". A great deal of time is spent on them so they can be readily available for mass spread. BUT, what if the market changed from public focus to individual? What if instead of spending months on creating an app for the public, you could make a quick and dirty app that had very few functions, but worked for the small scope that you needed it to? This is a really neat thought and something to definitely talk about more, but for now I'm focusing on the instance where I need to make an app that serves a specific purpose and I want to see how my user will interact.

For example, I'm working with the Lee County Emergency Management Agency on how they approach natural disaster relief. One of the specific areas we're analyzing is how social media is considered. On April 27, 2012 there were a series of horrific tornadoes that swept through our state. Because of the devastating carnage that ensued, 911 operators were tied up and those in peril could not contact anyone to let them know their plight. So, being resourceful, these people turned to social media to let anyone and everyone know what was wrong, where they were, and what they needed. This in effect, created a whole litany of other problems but the one we'll consider for the sake of this conversation was that this information was not going to the right people. Emergency responders were not notified of these people that were in need of help and therefore could not coordinate the proper relief efforts. So, people were rushing to help while wearing flip-flops and t-shirts and then stepping on rusty nails and becoming another victim in the picture. This image leaves us with some very distinct needs. The entity that is coordinating needs to have a picture of what information is traveling over local social media channels and have a way to manage tasks and send correspondence of needs/locations to people that can help.

Thus, I developed an app that will allow these things to happen. And here's how I did it:

Download the app "Prototype on Paper" from iTunes
Using some sort of methodology (I used design thinking as defined by the d.school at Stanford) to develop the "pages" of your app. Just like you would a website.
Launch the app
touch the "+" in the top left-hand corner of the homescreen after you've gone through the tutorial.
Enter a title for your app (or project as it's defined in the app)
Begin by touching the camera in the bottom left-hand corner of the screen
Take a picture of each of your "pages"
On the project screen (this is where all of your pictured pages sit in rows), select one of your pages.
On the top right-hand corner of the screen, touch the "+" that is inside a box. A red square will appear on your screen.
Touch and drag the red square to any place on your page where you intend for the user to touch to engage a new page. Resize by dragging one of the square corners at a time.
After reaching the desired location and size, touch the prompt "Link To".
On the next page, select the page you want that button to go to when pressed by the user. Note the bottom of the current page has 5 different selections for how the transition from one page to the next can occur.
After selecting, press "Done" in the top right-hand corner of the page.
Repeat this process until you have placed links to all the buttons on your drawn pages.
When you're ready to test your app, select the play button on either the top right (when close up to one of your drawn pages) or bottom center (when on the project's main page).
Navigate through your app and take note of anything you've forgotten.
If you forgot to paste a link, pinch your fingers together on the screen and go back to step 9.
Most important step, keep in mind you just threw together a quick and dirty app in like an hour. Now, give it to your user and see how they interact with it. Receive their criticism as an anthropologist, not an analyst. After all, what's to get upset about? You just spent a minimal amount of time creating this super useful tool and all you have to do to change it is erase something and draw something new or touch a few buttons.

I've created a video on my Youtube channel to show how this bad-boy works. See Below

I hope you enjoy!

Monday, January 14, 2013

Dear Big Data Students,

First...Thanks to everyone for helping establish a collegial, inviting and thoughtful classroom environment. Your willingness to critically engage with and talk about topics from the presentations suggests that we'll have a great opportunity to learn about big data, while having fun in the process.

With a large section of 55+ intellectually adept INSY students, we will rarely (if ever) have enough time to adequately address all topics and questions. Thankfully, you now have a digital space for such intellectual endeavors. The guidelines for using the blog are highlighted in both the Syllabus and in the lecture materials for the first two classes. In addition, please feel free to use the blog to:

elaborate upon an idea from in-class discussion; or
engage in another mode of critical reflection.

From the Dashboard, all you need to do is click “New Post”.

If questions or suggestions arise, please don’t hesitate to contact me. Don't forget, this blog is made available on the web to educate not only your INSY 4970 colleagues, but also to share your thoughts, reflections, tutorials and Big Data-related news to the rest of the world. Essentially, this is our space for showcasing the skills of ENG students at Auburn University in making sense of Big Data and using the knowledge captured to tackle large-scale engineering problems.

For the time being, please familiarize yourself with the Big Data Discussion blog. Whenever you are ready, feel free to compose an engaging and critically reflective post pertaining to big data analytics.

Let the digital critical engagement begin...

War Eagle!!
Fadel Megahed
www.fadelmegahed.com