Analytics and Visualization of Big Data: Tutorial

Showing posts with label Tutorial. Show all posts

Tuesday, March 19, 2013

Video Tutorial: Using Statwing to Analyze NBA Player Statistics

While searching for information related to our Big Data Course I found a web based data analysis tool called Statwing. This is a free program that anyone can use to check the statistical significance of one variable on another variable. For the video tutorial below I choose to look at NBA player statistics to see if there is any correlation between the team the player is on or the position they play on the # of games played, minutes played, usage rate, true shooting percentage, percent of field goals assists, rate of assists against possessions used, turnover rate, offensive rebound rate, defensive rebound rate, total rebound rate, and player efficiency rating. I gathered my data from www.hoopdata.com which is an affiliate of ESPN. In some areas the results kind of surprised me but many others were as expected. I hope you find the following video useful. Thanks for watching.

If you would like use this program please go to: www.statwing.com. It is easy to use and just requires an email address and password to sign up.

Big Data Conferences

Recently, while I was looking for more information on Big Data I came across some interesting conferences. Conferences allow for a huge quantity of information to be shared amongst peers. This sharing can cause advancements in the field to occur at a much faster rate.

One particular conference that I saw seems very interesting and had a very enticing list of speakers. This conference was focused on Big Data across many fields including the US Army, NASA, and numerous private industries. It brought together key agencies from the government and the private sector to share knowledge and help steer Big Data’s future. The conference is allowing the government to learn from resent lessons learned in the private industry and what new methods and techniques are being used to process this type of data. The conference was held on March 5-6, 2013 in Washington D.C. This is a conference I would try to attend in the future if you or your company sees the need.

There is another large conference coming up on April 8-10, 2013 in Boston, MA. This conference is called Big Data Techcon and will be made up of mostly tutorials, classes, and trainings. This will allow the attendee to receive hands on training in Hadoop, Map/Reduce and NoSQL databases and more. This conference also has a very impressive list of speakers to discuss this topic as well. The ability to actually see hands on classes and tutorials really makes this conference a great learning experience in which you can further your education in the Big Data field.

Below is a partial list of some of the speakers attending the two conferences.

Ms. Jo Strang
Associate Administrator, Safety, Federal Railroad Administration, DOT

• Dr. Sasi K. Pillay
Chief Technology Officer, Office of the CIO, NASA

• Dr. Sastry Pantula
Director, Mathematical Sciences, National Science Foundation

• Mr. Alan Shaffer – invited
Principal Deputy, OASD/R&E

• Mr. Greg Elin
Chief Data Officer, Federal Communications Commission

• Mr. Michael Simcock
Chief Data Architect, Department of Homeland Security

• Dr. Ashit Talukder
Chief, Information Access Division, NIST

• Dr. Mark Luker
Associate Director, National Coordination Office of NITRD

• Dr. Ashok Srivastava
Principal Scientist, Data Sciences, NASA Ames Research Center

• Mr. Niall Brennan
Director, Office of Information Products and Data Analytics, CMS/DHHS

• Mr. Jeff Butler
Director, Research Databases, Internal Revenue Service

• Mr. Shawn Kingsberry
Chief Information Officer, Recovery Accountability and Transparency Board

• Mr. Ted Okada
Senior Advisor for Technology, Office of the Administrator, FEMA

• Ms. Sophie Raseman
Director for Smart Disclosure, Department of the Treasury

• Mr. Dominic Sale
Policy Analyst, Office of Management and Budget

• Mr. Paul Reynolds
Information Architect, Department of Homeland Security

• Mr. John Montel
eRecords Service Manager, Department of the Interior (DOI)

• Ms. Marina Martin
Head, Education Data Initiative, Department of Education

• Senior Representative
Defense Advanced Research Projects Agency (DARPA)

• Dr. Nancy Grady
Technical Fellow, Data Scientist, Homeland and Civilian Solutions, SAIC

• Ms. Susie Adams
Vice President, Federal Sector, Microsoft

• Ms. Caron Kogan
Strategic Planning Director-Big Data, Lockheed Martin

• Mr. Bruce Weed
Program Director, Worldwide Big Data Business Development, IBM

• Mr. Kevin Jackson
Vice President and General Manager, Cloud Services, NJVC

• Mr. Bill Hartman
President, TerraEchos

• Mr. Mike Daconta
Vice President, Advanced Technology, InCadence Solutions

• Dr. Flavio Villanustre
Vice President, Technology Architecture and Product, LexisNexis

• Mr. Scott Gnau
Chief Development Officer, Teradata

• Mr. Dante Ricci
Director, SAP Federal Innovation

• Mr. Tom Plunkett
Senior Consultant, Oracle Public Sector

• Mr. Sean Brophy
Senior Analyst, Tableau Software

I think that more conferences of this manner should be held in order to allow further collaboration between colleagues and to build a larger network for the attendee. I think anyone that has the means to go to these conferences should take advantage of the massive amount of information that you would receive at a conference of this sort.

Sources:

http://www.bigdataconference.net/benefits-attending-big-data-conference/

http://www.bigdatatechcon.com/boston2013/#

Monday, March 18, 2013

Tutorial-Using Macros to Easily Graph WorldDataBank data in Google Spreadsheets

This is tutorial designed to easily help you create motion charts in Google Spreadsheets. I have linked to an Excel file from which you will be able to easily format the data as wanted. The advantage of using this method is that the data from WorldDataBank is not in the format needed to graph using Motion Charts. The macros contained in the Excel file will allow you to easily put the data in the correct format. Enjoy the video and happy Motion Graphing!

READ BELOW THIS IS IMPORTANT TO USING THIS TECHNIQUE!!!!!!

I was not able to directly upload the Excel file to Blogger because it is not allowed for security reasons. Because of this I have linked to a Dropbox folder containing the Excel file. In the video, the first step I give is to open the Excel file embedded on this Blogger post. Instead,

CLICK ON THIS LINK TO GET THE EXCEL FILE NEEDED TO USE THIS TECHNIQUE, THEN FOLLOW THE STEPS AS SHOWN IN THE VIDEO

https://www.dropbox.com/sh/rh9sfu0ojt27313/6hnJ8y4aNE

Sunday, March 17, 2013

Video Tutorial: A tip to enhance RapidMiner performance

In this video, I show you how to arrange the running order of parallel process in RapidMiner so that the computer memory is allocated to the operators efficiently.

Saturday, March 16, 2013

Video Tutorial: Neural Network Toolbox in MATLAB

Following my previous video about building Neural Network model in RapidMiner, I made an introductory video to show how to work with Neural Network Toolbox in MATLAB.

Friday, March 15, 2013

Text Processing Tutorial with RapidMiner

I know that a while back it was requested (on either Piazza or in class, can't remember) that someone post a tutorial about how to process a text document in RapidMiner and no one posted back. In this tutorial, I will try to fulfill that request by showing how to tokenize and filter a document into its different words and then do a word count for each word in a text document (I am essentially showing how to do the same assignment in HW 2 (plus filtering) but through RapidMiner and not AWS).

1) I first downloaded my document (The Entire Works of Mark Twain) through Project Gutenberg's website as a text document. Save the document in a file on your computer.

2) Open RapidMiner and click "New Process". On the left hand pane of your screen, there should be a tab that says "Operators"- this is where you can search and find all of the operators for RapidMiner and its extensions. By searching the Operators tab for "read", you should get an output like this (you can double click on the images below to enlarge them):

There are multiple read operators depending on which file you have, and most of them work the same way. If you scroll down, there is a "Read Documents" operator. Select this operator and enter it into your Main Process window by dragging it. When you select the Read Documents operator in the Main Process window, you should see a file uploader in the right-hand pane.

Select the text file you want to use.

3) After you have chosen your file, make sure that the output port on the Read Documents operator is connected to the "res" node in your Main Process. Click the "play" button to check that your file has been received correctly. Switch to the results perspective by clicking the icon that looks like a display chart above the "Process" tab at the top of the Main Process pane. Click the "Document (Read Document)" tab. Your output text should look something like this depending on the file you have chosen to process:

4) Now we will move on to processing the document to get a list of its different words and their individual count. Search the Operators list for "Process Documents". Drag this operator the same way as you did for the "Read Documents" operator into the main pane.

Double click the Process Documents operator to get inside the operator. This is where we will link operators together to take the entire text document and split it down into its word components. This consists of several operators that can be chosen by going into the Operator pane and looking at the Text Processing folder. You should see several more folders such as "Tokenization", "Extraction", "Filtering", "Stemming", "Transformation", and "Utility". These are some of the descriptions of what you can do to your document. The first thing that you would want to do to your document is to tokenize it. Tokenization creates a "bag of words" that are contained in your document. This allows you to do further filtering on your document. Search for the "Tokenize" operator and drag it into the "Process Documents" process.

Connect the "doc" node of the process to the "doc" input node of the operator if it has not automatically connected already. Now we are ready to filter the bag of words. In "Filtering" folder under the "Text Processing" operator folder, you can see the various filtering methods that you can apply to your process. For this example, I want to filter certain words out of my document that don't really have any meaning to the document itself (such as the words a, and, the, as, of, etc.); therefore, I will drag the "Filter Stopwords (English)" into my process because my document is in English. Also, I want to filter out any remaining words that are less than three characters. Select "Filter Tokens by Length" and set your parameters as desired (in this case, I want my min number of characters to be 3, and my max number of characters to be an arbitrarily large number since I don't care about an upper bound). Connect the nodes of each subsequent operator accordingly as in the picture.

After I filtered the bag of words by stopwords and length, I want to transform all of my words to lowercase since the same word would be counted differently if it was in uppercase vs. lowercase. Select the operator "Transform Cases" and drag it into the process.

5) Now that I have the sufficient operators in my process for this example, I check all of my node connections and click the "Play" button to run my process. If all goes well, your output should look like this in the results view:

Congrats! You are now able to see a word list containing all the different words in your document and their occurrence count next to it in the "Total Occurences" column. If you do not get this output, make sure that all of your nodes are connected correctly and also to the right type. Some errors are because your output at one node does not match the type expected at the input of the next node of an operator. If you are still having trouble, please comment or check out the Rapid-i support forum.

Video Tutorial: Web Scraping w/ Mozenda

Part of analyzing data is extracting the data from the web into a useable format. Web scraping allows the user to bypass many of the formatting issues of simply copying and pasting from the web. One of the more interesting features is that each job, or agent, is saved to the Mozenda surver. This reduces storage space on the users PC. It also allows for the scheduling of running agents. This can be very useful for mining data from a website that is constantly changing. For example, if a person wanted to keep track of the price of all televisions on Ebay, a scheduled agent could keep an update file of these prices. In this tutorial, I show how to use some of the basic features of Mozenda.

Links:

Mozenda Website: http://www.mozenda.com/

Tutorials: http://www.mozenda.com/video01-overview

Data used in this video: http://espn.go.com/college-sports//football/recruiting/playerrankings/_/view/rn300/sort/rank/class/2013

Server Virtualization and its Effect on Data Management

I recently stumbled upon a very interesting seven part series of videos involving data management. This series is conducted by Jon Toigo, the “Data Management Institute Chairman and Toigo Partners International CEO.” This particular part, and accompanying video, involves data storage needs as a result of server virtualization.

When server virtualization was introduced, many companies jumped on board without fully understanding the concept or its effect on storage. However, Toigo explains that server virtualization has actually led to an increase in demand for storage capacity. Original estimates from 2011 predicted an increase in demand of 30% annually through 2014. This figure was modified one year later to 300%. At the same time, another firm predicts 650% estimate growth. These figures are staggering when looked at from a storage capacity perspective. The original estimate called for 46 exabytes of total installed external storage capacity. Once updated, to figure rose to 168 EB or 212 EB depending on which firm you believe. In this video, Toigo explains the effect server virtualization has on data storage requirements. Many of the concepts are above my limited understanding, but it appears that server virtualization often depends on data replication as a failsafe. This replication of large chunks of data is driving the demand for capacity through the roof. Another problem is the support for “proprietary functionality in the server hypervisor software.”

In this link, Toigo discusses this concept that he describes as another part of the “storage infrastruggle.” There are several links on this page, though some require a free membership. One link in particular provides a brief tutorial on managing a server virtualization environment. The rest of the seven part series can be found on the left side about halfway down the page.

Resources:

1) Server Virtualization Issues: http://searchstorage.techtarget.com/video/Server-virtualization-issues-contribute-to-data-management-complexity

2) Virtual Server Tutorial: http://searchvirtualstorage.techtarget.com/report/Virtual-server-tutorial-Managing-a-server-virtualization-environment

Thursday, March 14, 2013

Video Tutorial: Neural Network in RapidMiner

This tutorial shows how to build a Neural Network model in RapidMiner.

Distance-based clusterings

Clustering is an important unsupervised learning method. The main idea is to cluster data points (or feature vectors, observations) into groups (Jain, Murty & Flynn, 1999) and so get a classified structure in a collection of unlabelled data, based on similarity criterion. In contrast to classification, in clustering, data points are appointed into groups whose members have similar properties in some way (Moore, 2001).

The main similarity criterion is distance. The data points which are closer to each other by comparing to the other points are considered in the same cluster. This is called distance-based clustering. The distance between points is given by the Euclidean distance:

where x and y are any data points on two dimensional space.

The first distance-based clustering procedure is Hierarchical Clustering.

Hierarchical clustering is a set of nested sets. The clustering based on Euclidean distance works by merging 2 clusters at a time.

Another important distance-based clustering algorithm is K-Means Clustering.

K-Means is a simple and unsupervised clustering algorithm in data mining. The general idea in this method is to separate a sum of observations into clusters. The separation is done according to means of clusters; each observation is classified into a cluster with the nearest mean (centroid) (MacQueen, 1967).

Dramatic differences between the sizes, densities of clusters, empty clusters and outliers may be problem for this algorithm (Kumar, 2002).

Topsy: One stop shop for all things Social Media

Brand management has become a very central focus for big business. Because of tools like Topsy, companies have the ability to observe their impact on social media both historically and as in real-time. The site has a great many tools that allow in-depth analysis. If you sign up, you can use it for free for 30 days. However, you'll have to use your Auburn email as they require a "corporate" email account (gmail won't work). But, if you don't want to do that, you can visit the site and use some of the truly free tools (not in Topsy Pro) here.

In Topsy (free), you can search both hashtags (example: "#winning") in tweets or any phrase that you'd like to observe (example: "welfare reform"). What's especially interesting is when you start doing comparisons. If you visit the site, click on "Social Analytics" at the top of the homepage and observe the graph (seen below). This graph compares the activity of the words "ipad", "kindle", or "galaxy nexus" over the specificed time.

In Topsy Pro, you can manage different hashtags and phrases on the same dashboard by selecting/deselecting check boxes. This allows you to compare a very large amount of phrases and hashtags all in one place. You can also select which social media sites you want to apply your search to. Topsy Pro isn't limited only to twitter, you can search Flickr, Facebook, Tumblr, and several others.

Lastly, one of the greatest features of Topsy Pro is the "related words" feature. After a user enters an input value (hastag or phrase), the user can select a tab on the dashboard that reveals a tile GUI with other, related hashtags and phrases that are associated with the inputs. This is a tremendous tool because it allows a company to view not only what words are associated with their brand, but the words and phrases that are gathering significant momentum.

This tool has a wide variety of other features, but these are some of the most significant that I thought I should mention. Get a free subscription and try it out.

I used Topsy Pro until my subscription ran out (which is why I don't have any other screen captures to post). I applied the tool to the SGA elections to see who was trending each day. I'm going to post a tutorial using that as an example in the near future.