Tuesday, April 23, 2013

OutWit Hub: Web-scraping made easy

I read a blog earlier this term on web-scraping and decided to check it out. I started with the suggested software, and quickly realized that there are only a few really good tools available for web-scraping and that are supported by Max OS. So, after reading a few reviews, I landed on OutWit Hub.

OutWit Hub has 2 versions: Basic and Pro. The difference is in available tools. In basic, the "words" tools isn't available. This aspect allows you to see the frequency of any word as it occurs on the page you are currently viewing. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.

I'm not a computer scientist, by a long shot, but I have a general grasp on coding and how computers operate. For this reason, I really like OutWit Hub. The tutorials on this site are incredible. They walk you through examples and you can interact with the UI while the tutorial is going. Also, a lot of the tools are pretty intuitive to use. If you're not sold on getting the Pro version, I'd encourage you to visit their website and download the free version just to check out the tutorials. They're really great.

I've used the site for several examples just to test. I needed to get all of the emails off of an organization's website, so instead of copy/pasting everything and praying for the best, I used the "email" feature on OutWit and all of the names and emails of every member on the page populated an exportable table. #boom

Then, I wanted to see if it could be harnessed for Twitter and Facebook. So, using the source-code approach to scraping, I was able to extract text from the loaded parts of my Twitter and Facebook feeds. The problems I encountered were: Not knowing enough about the coding to make the scraper dynamic enough to peruse through unloaded pages, and not knowing how to automate and build a larger dataset (i.e. continuously run the scraper over a set amount of time by continuously reloading the page and harvesting the data. It's possible, I just didn't figure it out).

So, I've videoed a tutorial on how to use OutWit Hub Pro's scraper feature to scrape the loaded part of your Facebook news feed. Below are the written instructions and the video at the bottom gives you the visual.

Essentially, you will:
1.) Launch OutWit Hub (presuming you've downloaded and upgraded to Pro).
2.) Login to your profile on Facebook.
3.) Take note of whatever text you want to capture as a reference point when you go to look in the code. This is assuming you don't know how to read html. For example, if the first person on your news feed says: "Hey check out this video!", then take note of their statement "Hey check out this video!"
4.) Click the "scrapers" item on the left side of the screen.
5.) In the search window, type in the text "Hey check out this video" and observe the indicators in the code that mark the beginning and end of that text.
5.) In the window below the code, click the "New" button.
6.) Type in a name for the scraper
7.) Click the checkbox in row 1 of the window.
8.) Enter a title/description for the information you're collecting in the first column. Using the same example: "Stuff friends say on FB" or "Text". It really only matters if you're going to be extracting other data from the same page and want to keep it separate.
9.) Type in the html code that you indicated as the beginning to the data that you want to extract under the "Marker Before" column.
10.) Repeat step 9 for the next column using the html code that you indicated as the end to the data.
11.) Click "Execute".
12.) Your data is now available for export in several templates - CSV, Excel, SQL, HTML, TXT

Here is a Youtube video example of me using it to extract and display comments made by my Facebook friends that appeared on my news feed.











Monday, April 22, 2013

Visualization - Wilson Lyle and Jessica Clemmensen


This is the link to our visualization project. Our project focused on obesity rates beginning in 2005 in all 50 states and tracked the growing obesity rates in the United States. We then mined this data to provide information on population and obesity growth rates for the same years, ending in 2011. 

http://www.youtube.com/watch?v=tgMJiIsTOEA



Image Comparison in One Mobile App

One app called "Sleep if you can" which are available for both Android and iOS is very interesting. I know this app from my friend in social network. The app introduction page for android is: https://play.google.com/store/apps/details?id=droom.sleepIfUCan&hl=en

Briefly describe the work process. This an alarm app. Like other alarm apps, it will turn on the alarm when the preset time is up. The difference and interested thing is how to turn the alarm off: users have to take a similar photo which they took while set the alarm using the cellphone/mobile device. If the app thinks the picture is same, it will turn off the alarm. In this way, user has to get up and take pics, it more effective than regular alarms which could be turned off just by swiping screens. 

I am interested how the app compare the two pics. If found one method form internet, this method is to compare pixels of two images. In this guy's blog (http://jeffkreeftmeijer.com/2011/comparing-images-and-creating-image-diffs/), he mentioned this method. 

Further, how to compare two pixel colors? In my opinion, if the color of two pixels are different, they are probably different. This guy mentioned one method called "Delta E" (http://en.wikipedia.org/wiki/Color_difference#CIE76). 

Colors could be defined by color space, Delta E use "Lab" (http://en.wikipedia.org/wiki/L*a*b*). This space has three specifications. The pic from Wikipedia shows them. L: -black<->white+, a: -green<->magenta, b: -blue<->-yellow.

Then, the differences of colors are calculated by the formulator from wiki:
E(a,b)=sqrt((L1-L2)^2+(a1-a2)^2+(b1-b2)^2)

I think if the E value is too high, the pixel is different and the pics are probably different. 

I also searched internet, there are many posts talked about comparing of two images, do you guys has any ideas of how this app works?


Sunday, April 21, 2013

I don't always buy beer, but when I do I buy diapers


There aren't many data mining memes but the most prominent one involves the correlation between the sales of beer and diapers.  These associations could be due to a number of reasons, including the following:  

-That a post pregnant women is too tired or too weak to carry a giant bag of diapers, so they send their burly husbands out to pick some up. While they are out, they realize that they are off work and it is now time to put down a few beers, so while he's out he picks up his favorite brew.
-A poo-poo happens late at night. While the mother watches out and takes care of the baby, the husband is RUSHED to the grocery store to pick up some diapers. Being annoyed, he also picks up a 12 pack to relax.
-Just the the convenience factor. Well I don't have any beer at the house guess I'll pick some up while I'm out.

When doing studies on how these 2 sales relate you have to be careful with the correlation doesn't equal causation phenomenon. Without having a background in statistics and hypothesis testing, you can easily reach a misleading conclusion. This is a shame because there is much power in discovering how things relate. If this relationship of diapers and beer appears to be associated to a large enough percentage, businesses should react in a proper manner. Possibly putting the diapers and beer shelves closer to one another would so dramatic change in sales (and show further association among the 2.)

source: http://blog.patternbuilders.com/tag/retail-analytics/

How Obama is gonna get what he wants


President Obama has big plans for his next term and big vision for efficient energy. In the state of the union address he challenged every US citizen to do more with the energy they use and become more efficient. President Obama states, “Let’s cut in half the energy wasted by our homes and businesses over the next 20 years." When performing this radical of a change, it is estimated that it would create 1,300,00 jobs as well as reduce the yearly household electrical bill by $1000 . While yes, construction is beginning to become more efficient with the use of, LEED-certified silver, gold and platinum buildings. We have hardly tapped into what we are capable to reaching. In order to reach the massive savings in energy consumption, we have to optimize the operation of current urban life.

This is where Big data analytics come in to play. In analyzing how buildings operate, we first need a way to gather data concerning electrical output. Currently, there is no technology that fits into this role without paying a large chunk of money. One company, Seto, has come to the rescue. Seto designed a divice called a WebMeter. This device is easy to install, has a low cost, and monitors the electric flow of up to 36 individual circuits in a building circuit board. Readings from this device are recorded on a server chosen by the user and can be accessed at any point of operation.  Now that this new technology has been released, companies can now dive into the applications of big data concerning energy consumption.

source: http://theenergycollective.com/tyhamilton/187006/big-data-key-unlocking-big-gains-energy-productivity

Visualization TedTalks: data changing views like never before






   This is a TedTalks presentation given by David McCandless on various data visualizations that I just found very interesting. He showed visualization on “global media panic” over time and to what level the media was reporting it.  The Killer wasp graphic was quite funny. He goes on to show one particular media panic, violent video games, has a cyclic pattern occurring over time in which every November and April, there is a peak. He claims that the November peak may be due to there being a surge in video games coming out for Christmas, while April is an anniversary for the columbine high school tragedy.                




                At 9:20 in the video, he shows a visualization representing an analogy for “bandwidth of the senses” which from my understanding basically says “if your senses were computers, this would be their bit rates.” The visualization clearly shows that eyesight has the highest “bandwidth.” He goes on to make more analogies about the senses by saying that the throughput (bit rate) for the eyes is analogous to the throughput of a computer network whereas the throughput of taste is only at par with that of a basic pocket calculator.  
                Another Data set that he brings into an interesting visualization is a comparison of GDP to military budget. Given I always thought the US to be a military powerhouse, if you compare its GDP to military spending; it isn’t actually at the top. It has the 8th highest Military budget to GDP ratio.
                He wraps up the presentation by saying there are many information problems as in having just far too much data, and visualizations provide a quick solution where large amounts of data can be understood very quickly to solve such a problem.

Association Rules with RapidMiner



Here is a video I made on how to do Association rules among text documents in RapidMiner

Saturday, April 20, 2013

John Deere's Future in Big Data


With the world’s populations rising and expected to increase another two billion in the next thirty-five to forty years, food production must increase by 60% to sustain this massive amount of people (UN Food and Agriculture Org.). This article on BigData Startup’s website highlights that John Deere has entered the realm of big data with the introduction of new products.

John Deere has recently put sensors on their newest pieces of equipment that helps decrease downtime of the equipment and decrease fuel usage. It is also possible now to get up-to-date data on weather, soil conditions, and crop features to be communicated directly through the equipment to farmers. This allows for the users to know the best locations and times to harvest their crops in order to maximize the productivity and efficiency of their land.

While not using as many data points as corporations such as Wal-Mart or Amazon, they are working on increasing their data to process, because more data generally equates to more uses and a more accurate analysis. In order to do this John Deere is using the programming language R. Once R is used to forecast information such as demand and crop yield, it then exports the information to channels such as FarmSight, FarmSight Mobile Farm Manager, and MyJohnDeere.com. The article states that FarmSight is used to increase productivity by improving on three different areas; machine optimization, logistics, and assisting in decision making. With John Deere keeping with the times, and investing in the possibilities in big data, their future looks promising. 


At the very end of the article there is an interesting video made by John Deere following a day in the life of a farmer with future John Deere technology.

Information from:

Friday, April 19, 2013

Rapid Miner Data Aggregation Tutorial

I have used the aggregation function in Rapid Miner many times while working on David and I's fantasy football project. It is very useful for compiling weekly statistics of athletes. Here is a step by step tutorial of how to use it. First open up Rapid Miner and begin a new process. Next import your data. In my case I am importing an Excel sheet that contains weekly NFL QB stats from 2008-2011.

After selecting your mode of import choose the correct sheet or file to import.


Select your sheet and click Next.


The next page (step 3 of import wizard) will ask if you want to make any annotations. This is not necessary for this data so I will go ahead and click next which brings us to step 4 shown above. De-select any columns that you do not want to import. In this case I do not care to see what teams the QBs play for. You MUST make the column you want to sort by ID. You can see in the first column that contains the names was changed from attribute to ID. After you do that click Next and save your data.


Once you get into your main process drag and drop your data onto process area. Then look to the left hand side. Click Data Transformation > Aggregation. Drag and drop the aggregate widget onto the process area. Next connect the out port of the data to the exa port on the left side of the aggregate widget. Then connect the exa port on the right side of the widget to the result port.


After you connect the ports select edit list by aggregation attributes. Here make an entry for each attribute you want to aggregate and select the functions you want to use. After you do this click Ok.


Next click select attribute by group by attributes. Here move your ID column (in this case Name) into the right side by selecting your ID and clicking on the arrow pointing right. Click Ok.


Now just click the play button on the toolbar and you get your results! From here you can export your data as you like or enter plot view or advanced charts. Hope this tutorial was helpful.

How Data Crunchers Helped Obama Win Election

The 2008 presidential campaign set out by Obama's campaign team marked the first time that campaign strategies were based on quantitative data instead of hunches and subjectivity.  With the success found in 2008, Obama's campaign team decided to invest even heavier in data analytics for the 2012 election. 

Getting Money
With all the success found in 2008 using big data analytics, a major weakness also became apparent.. Too many datasets.  So one of the first tasks to tackle for Obama's campaign team going into the 2012 election was merging all these data sets into one huge data set as well as expanding the data analytics team 5 times as big.  Going into the 2012 campaign the goal was set to raise $1 billion dollars, a seemingly large amount of money.  They accomplished this however by strategically targeting possible donors through emails and attractive fundraising events.  An example of this strategy came "In late spring, the backroom number crunchers who powered Barack Obama’s campaign to victory noticed that George Clooney had an almost gravitational tug on West Coast females ages 40 to 49. The women were far and away the single demographic group most likely to hand over cash, for a chance to dine in Hollywood with Clooney — and Obama." A very interesting approach to say the least.. The same was done on the east coast and Sarah Jessica Parker was the celebrity chosen to use at a fundraising event.

Swing Voters
In order to gain insights in behaviors of swing voters, the campaign team obtained huge amounts of polling data in swing statesThis proved to be a huge advantage as they were able to allocate resources more efficiently.  They also dipped into social networks such as Reddit and Facebook because they found these to be very successful in swaying swing voters.  Another unique marketing strategy that came from their big data analytics was the decision to target ads during tv shows instead of the past method of airing ads between news programs. 

With the huge success Obama's campaign team experienced in 2012 data mining and using big data analytics, a new way of quantitatively developing campaign strategies was born and should prove to be the norm going into the future.


http://swampland.time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-crunchers-who-helped-obama-win/2/

NFL Draft Data Visualization

After working on David and I's fantasy football project I was doing some research on the NFL draft prospects. The main website we looked at for draft prospects and their ratings is scout.com. This is the same rating site that ESPN uses. They have a list of every notable player entering the draft along with a rating of 1-5 stars, college attended, and hometown. Fadel showed us earlier in the semester a visualization from ESPN that showed a heat map of where high school football recruits come from and how it has changed over the years. < http://espn.go.com/blog/playbook/visuals/post/_/id/10989/graphic-football-recruiting-then-now >. This visualization inspired me to create a heat map of where NFL draft prospects hail from, but I also want to expand on that. First this is a map simply showing where the NFL draft hopefuls are from. The darker states are the states that have more prospects.



As you can see this is very similar to the visualization that ESPN provided. Makes complete sense. California, Texas, Florida, Georgia, and Ohio rank as the highest states. These states are also extremely populous. I decided to make a chart showing the draft prospects per capita. This may give a better indication of which state provides the most "bang for your buck" talent wise.


Looks like the hotspots have changed. Louisiana, South Carolina, Georgia, Kansas, and surprisingly Hawaii look to be the highest per capita states. Do these states just naturally have a lot of talent? Possibly. It may also be a factor of players from these states going to in state schools that have great NFL player development. LSU recruits a lot of in state players and are also great at developing next level talent. The same goes for Clemson and South Carolina. Also not surprisingly the north mid west and the north east do not provide any NFL talent.

Google's Data Navy

Based on a conversation I had with a friend and a comment I did in one of my classmate's blog post, I wanted to expand on this concept and its implications (plus couple of conspiracy theories).

Google's Data Navy *

Like any country, I guess Google is planning to arm itself with its own navy of "data center boats" and go to war in the information/data arena. Back in 2008, Google filed for a patent that described a "water-based data center." These new "water world" data centers will give Google the ability to get computer centers closer to some "hard/costly to reach" customers. In addition, the patent presents the possibility of these ships generating their own electricity from ocean water and being completely autonomous. I am sure that environmental groups will give them a lot of support and free advertisement. However, the question is what is Google's real intent with such an ambitious development?

Microsoft seems to be working in a very similar idea; they are in the middle of creating a massive/mother data center made of several data centers inside standard shipping containers (metal containers) to maximize the use of space. Microsoft's containers have the advantage of being "modules that can be moved around to get the most computing power possible per square foot."* Other major server manufacturers like HP, IBM, Dell, and Sun Microsystems "have created their own data centers in shipping containers that they sell to service providers, the military and research labs."* Google had considered the idea of using containers in the past; the company filed a patent on the containers and they even built a prototype system in the garage of their Mountain View, California headquarters.

Google's arguments for the new technology is the following:
1) "Bring data closer to customers"*
2) "Floating data centers could aid the military or help out during a large event"*

What about purposes 3, 4, 5, etc? Do we really believe these are the only two purposes?

At the end of the day, all this "power" (I personally see information as power) is extremely dangerous in the hands of a few. It really scares me to see companies like Google and Walmart which have Profits that, if considered countries, could be in the top 30 - 50 GDPs in the world. Since Google owners already have so much money (each of them has a "customized" Boeing 727), would the company continue to live in gray areas and start considering unethical and aggressive use of data? In "lawless" international waters, could this become a paradise for the corrupt?

The following articles/blogs are concerning; the following publications already display several unethical behaviors from Google; what's next? Manipulate countries into Wars? It has been said that the third world war will be "ethnic based" - would the fourth world war be "information based"? This is very scary stuff. 


Google Faces Antitrust Complaint In India For Possible Unethical Practices

http://www.ibtimes.com/google-faces-antitrust-complaint-india-possible-unethical-practices-702224


Google PR Nightmare: Search Giant Apologizes for Evildoing

http://www.searchenginejournal.com/google-misuses-mocality-database/38789/


Google Still Dealing with Unethical Behavior Allegations in Italy and Beyond

http://siliconangle.com/blog/2011/01/18/google-still-dealing-with-unethical-behavior-allegations-in-italy-and-beyond/

Search & Destroy - Why you can't trust Google!
 http://www.searchanddestroybook.com/book.php


Main Reference/Source
* This post is based on New York Times blog post called "Google's Search Goes Out to Sea"

http://bits.blogs.nytimes.com/2008/09/07/googles-search-goes-out-to-sea/



Thursday, April 18, 2013

DM Project - Super Bowl Teams

     The following video showcases the analysis of the distance education group consisting of Jay Long, Julian Olander, and Wesley McDonald. 


Machine Learning: the Bridge to AI






    Given that machine learning  essentially deals with developing algorithms and systems which “learn” from data, it is not surprising that machine learning has a great deal of applications in the development of artificial intelligence. Being a science fiction fan, it is quite possibly a bridge to autonomous, bipedal, humanoid, robots which hopefully won’t end up like terminator.
    Here is a lecture from UC Berkeley from the electrical and computer engineering department from UC Berkeley.




 At the beginning of the lecture, he starts off by explaining an experiment in which he attempts to train a remote control model (not the real big kind) helicopter to perform some rather intricate maneuvers which can normally only be done by expertly skilled model helicopter pilot.
The data collected, as I imagine being the training set for this scenario is the position, velocity, orientation and the angular rate of the helicopter. They also collected data on the controller inputs. He shows data on the trajectories of various iterations of the same helicopter air show in order to gather data from that in order to apply the machine learning, also. Interestingly, he uses the Needleman Wunsch sequencing alignment algorithm across the iterations of the human trial trajectories in order to get a learning trajectory I imagine he will be feeding to the machine learning algorithm (I discussed Needleman Wunsch in one of my previous post, but it had to deal with bioinformatics, and not AI).
He mentions that the primary purpose of machine learning in this context is to train a computer to do the things which is generally associated with the same learning a human pilot would undergo, (I.e., a human pilot spends many hours training and doing the same maneuvers until it becomes “muscle memory” and he may know exactly what some action on the controls will do before he does them.)
          Around 27:46 in the video, he begins lecturing on how machine learning can be used to train robots to perform surgery. The surgery however is not terribly crazy like a brain transplant. It’s just a basic surgical know tie. At the end of the video, he goes into demonstrating how he applied machine learning to teach a quad pedal robot how to walk across a rough surface. Just check out the video.

The lecturer in the video is  
Pieter Abbeel, Department of Electrical Engineering and Computer Sciences, UC Berkeley




Data Mining Project - Fantasy Baseball Analytics










Project Team Members:
Jason Buckner
Sam Green
Chris Shaw
Justin Willette

Introducing Global Terrorism Data

Global Terrorism Database is one of the data source introduced in the first class. This database (GTD) has been managed by University of Maryland. It has records from 1970 through 2011 and keeps on piling up recent data related to terrorism activities. It is provided with csv file format, which is plain text, and when transporting it to Excel, the data set will be shown like below.

Each record consists of 98 attributes including ID, time, location, target, weapon, and so on. It is basically a time-series data set. Considering this property, GTD site also provides a graphical interface with this data set, called GTD Data Rivers which generates and shows a diagram like a big flow of water. 


The Combo box at the top of the screen shows the list of attributes users can choose. The main part of the screen demonstrates the change of terrorism activities corresponding to the choice users made over time. Adjusting the red bar at the left bottom, users can narrow down the focus they want to see. The right bottom part of the screen illustrates the amount of each element in the Data Rivers above.

Tuesday, April 16, 2013

Kaggle competition, Titanic: Machine Learning from Disaster

This video is about the kaggle completion called Titanic: Machine Learning from Disaster.
In this video, we compared 4 different models to solve the problem. The models are:

  1. Regression with MINITAB
  2. Regression with RapidMiner
  3. Decision Tree with RapidMiner
  4. Artificial Neural Network with MATLAB


Team Members:
Shahab Derhami
MohammadNaser Ansari

Monday, April 15, 2013

Basic Tableau Tutorial

This is a pretty basic tutorial about how to load and do basic functions and visualization in Tableau.
Here is a link to the free trial of the software:
http://www.tableausoftware.com/trial-resource-center

Google Geo Chart Tutorial

I have been playing around with Geo Charts using Google Spreadsheets and thought I would post an easy straightforward tutorial for anyone who has not tried it out yet. Start by logging onto your Google account and clicking Drive on the Toolbar. Next click spreadsheet.
After this paste your data into columns A and B. Column A must hold the locations and column B must hold the value. These locations can be countries or states. For my example I am showing the number of NFL draft prospects for each state.

After this select columns A and B then click Insert > Chart.


Once you click chart select Charts then Map. Also make sure the Geo Chart regions choice is selected. Then click Insert.


After clicking insert you will see the box graph show up. Click on the arrow in the upper right hand corner then click on Advanced Edit.
Almost done. Once you get to this edit area change the region to match your list of locations then choose your color scale for min, mid, max, and no values.
 

Click update and you are done! Very simple visualization. Also has a global map and maps for each continent.

Predicting the Future with Data from the Past

Historians have long searched for answers about today's world from the past, such as "Why do civilizations collapse?". Although historians look for answers from language, today mathematicians like Peter Turchin, a professor at the University of Connecticut, are using math to gain further insight.  Turchin is the driving force behind a field called “cliodynamics,” where scientists and mathematicians analyze history in the hopes of finding patterns they can then use to predict the future.  And unless something changes, according to Turchin, the U.S should expect a large amount of violence (terrorist activity, uprising) in the year 2020.  A summary of these 100 year "waves of violence" in the U.S can be seen below:  It is interesting to note that these spikes in violence occur in 50 year cycles in the U.S, and that these "secular cycles" occurred in all past agrarian states in which records were available (i.e Ancient Rome, Medieval England, Dynastic China, Russia).



 Although Turchin is not able to apply many big data techniques in his analysis due to the lack of historic data sets, he admits that creating models on these historical data sets were not even possible until recent history when old documents started to become digital.

http://www.wired.com/wiredenterprise/2013/04/cliodynamics-peter-turchin/