Industrial engineering students at Auburn University blog about big data. War Eagle!!
Tuesday, April 23, 2013
Kaggle Project
Wilson and I completed the Kaggle Project on the Titanic and machine learning, similar to the presentation today in class. We uploaded a video to YouTube explaining our approach. We utilized both Excel and Python to obtain a model which predicts whether a passenger will survive or not based on machine learning principles learned in class. The link is posted below.
http://www.youtube.com/watch?v=WqousZZSLFs
OutWit Hub: Web-scraping made easy
I read a blog earlier this term on web-scraping and decided to check it out. I started with the suggested software, and quickly realized that there are only a few really good tools available for web-scraping and that are supported by Max OS. So, after reading a few reviews, I landed on OutWit Hub.
OutWit Hub has 2 versions: Basic and Pro. The difference is in available tools. In basic, the "words" tools isn't available. This aspect allows you to see the frequency of any word as it occurs on the page you are currently viewing. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.
I'm not a computer scientist, by a long shot, but I have a general grasp on coding and how computers operate. For this reason, I really like OutWit Hub. The tutorials on this site are incredible. They walk you through examples and you can interact with the UI while the tutorial is going. Also, a lot of the tools are pretty intuitive to use. If you're not sold on getting the Pro version, I'd encourage you to visit their website and download the free version just to check out the tutorials. They're really great.
I've used the site for several examples just to test. I needed to get all of the emails off of an organization's website, so instead of copy/pasting everything and praying for the best, I used the "email" feature on OutWit and all of the names and emails of every member on the page populated an exportable table. #boom
Then, I wanted to see if it could be harnessed for Twitter and Facebook. So, using the source-code approach to scraping, I was able to extract text from the loaded parts of my Twitter and Facebook feeds. The problems I encountered were: Not knowing enough about the coding to make the scraper dynamic enough to peruse through unloaded pages, and not knowing how to automate and build a larger dataset (i.e. continuously run the scraper over a set amount of time by continuously reloading the page and harvesting the data. It's possible, I just didn't figure it out).
So, I've videoed a tutorial on how to use OutWit Hub Pro's scraper feature to scrape the loaded part of your Facebook news feed. Below are the written instructions and the video at the bottom gives you the visual.
Essentially, you will:
1.) Launch OutWit Hub (presuming you've downloaded and upgraded to Pro).
2.) Login to your profile on Facebook.
3.) Take note of whatever text you want to capture as a reference point when you go to look in the code. This is assuming you don't know how to read html. For example, if the first person on your news feed says: "Hey check out this video!", then take note of their statement "Hey check out this video!"
4.) Click the "scrapers" item on the left side of the screen.
5.) In the search window, type in the text "Hey check out this video" and observe the indicators in the code that mark the beginning and end of that text.
5.) In the window below the code, click the "New" button.
6.) Type in a name for the scraper
7.) Click the checkbox in row 1 of the window.
8.) Enter a title/description for the information you're collecting in the first column. Using the same example: "Stuff friends say on FB" or "Text". It really only matters if you're going to be extracting other data from the same page and want to keep it separate.
9.) Type in the html code that you indicated as the beginning to the data that you want to extract under the "Marker Before" column.
10.) Repeat step 9 for the next column using the html code that you indicated as the end to the data.
11.) Click "Execute".
12.) Your data is now available for export in several templates - CSV, Excel, SQL, HTML, TXT
Here is a Youtube video example of me using it to extract and display comments made by my Facebook friends that appeared on my news feed.
OutWit Hub has 2 versions: Basic and Pro. The difference is in available tools. In basic, the "words" tools isn't available. This aspect allows you to see the frequency of any word as it occurs on the page you are currently viewing. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.
I'm not a computer scientist, by a long shot, but I have a general grasp on coding and how computers operate. For this reason, I really like OutWit Hub. The tutorials on this site are incredible. They walk you through examples and you can interact with the UI while the tutorial is going. Also, a lot of the tools are pretty intuitive to use. If you're not sold on getting the Pro version, I'd encourage you to visit their website and download the free version just to check out the tutorials. They're really great.
I've used the site for several examples just to test. I needed to get all of the emails off of an organization's website, so instead of copy/pasting everything and praying for the best, I used the "email" feature on OutWit and all of the names and emails of every member on the page populated an exportable table. #boom
Then, I wanted to see if it could be harnessed for Twitter and Facebook. So, using the source-code approach to scraping, I was able to extract text from the loaded parts of my Twitter and Facebook feeds. The problems I encountered were: Not knowing enough about the coding to make the scraper dynamic enough to peruse through unloaded pages, and not knowing how to automate and build a larger dataset (i.e. continuously run the scraper over a set amount of time by continuously reloading the page and harvesting the data. It's possible, I just didn't figure it out).
So, I've videoed a tutorial on how to use OutWit Hub Pro's scraper feature to scrape the loaded part of your Facebook news feed. Below are the written instructions and the video at the bottom gives you the visual.
Essentially, you will:
1.) Launch OutWit Hub (presuming you've downloaded and upgraded to Pro).
2.) Login to your profile on Facebook.
3.) Take note of whatever text you want to capture as a reference point when you go to look in the code. This is assuming you don't know how to read html. For example, if the first person on your news feed says: "Hey check out this video!", then take note of their statement "Hey check out this video!"
4.) Click the "scrapers" item on the left side of the screen.
5.) In the search window, type in the text "Hey check out this video" and observe the indicators in the code that mark the beginning and end of that text.
5.) In the window below the code, click the "New" button.
6.) Type in a name for the scraper
7.) Click the checkbox in row 1 of the window.
8.) Enter a title/description for the information you're collecting in the first column. Using the same example: "Stuff friends say on FB" or "Text". It really only matters if you're going to be extracting other data from the same page and want to keep it separate.
9.) Type in the html code that you indicated as the beginning to the data that you want to extract under the "Marker Before" column.
10.) Repeat step 9 for the next column using the html code that you indicated as the end to the data.
11.) Click "Execute".
12.) Your data is now available for export in several templates - CSV, Excel, SQL, HTML, TXT
Here is a Youtube video example of me using it to extract and display comments made by my Facebook friends that appeared on my news feed.
Monday, April 22, 2013
Visualization - Wilson Lyle and Jessica Clemmensen
This is the link to our visualization project. Our project focused on obesity rates beginning in 2005 in all 50 states and tracked the growing obesity rates in the United States. We then mined this data to provide information on population and obesity growth rates for the same years, ending in 2011.
Image Comparison in One Mobile App
One app called "Sleep if you can" which are available for both Android and iOS is very interesting. I know this app from my friend in social network. The app introduction page for android is: https://play.google.com/store/apps/details?id=droom.sleepIfUCan&hl=en
Briefly describe the work process. This an alarm app. Like other alarm apps, it will turn on the alarm when the preset time is up. The difference and interested thing is how to turn the alarm off: users have to take a similar photo which they took while set the alarm using the cellphone/mobile device. If the app thinks the picture is same, it will turn off the alarm. In this way, user has to get up and take pics, it more effective than regular alarms which could be turned off just by swiping screens.
I am interested how the app compare the two pics. If found one method form internet, this method is to compare pixels of two images. In this guy's blog (http://jeffkreeftmeijer.com/2011/comparing-images-and-creating-image-diffs/), he mentioned this method.
Further, how to compare two pixel colors? In my opinion, if the color of two pixels are different, they are probably different. This guy mentioned one method called "Delta E" (http://en.wikipedia.org/wiki/Color_difference#CIE76).
Colors could be defined by color space, Delta E use "Lab" (http://en.wikipedia.org/wiki/L*a*b*). This space has three specifications. The pic from Wikipedia shows them. L: -black<->white+, a: -green<->magenta, b: -blue<->-yellow.
Briefly describe the work process. This an alarm app. Like other alarm apps, it will turn on the alarm when the preset time is up. The difference and interested thing is how to turn the alarm off: users have to take a similar photo which they took while set the alarm using the cellphone/mobile device. If the app thinks the picture is same, it will turn off the alarm. In this way, user has to get up and take pics, it more effective than regular alarms which could be turned off just by swiping screens.
I am interested how the app compare the two pics. If found one method form internet, this method is to compare pixels of two images. In this guy's blog (http://jeffkreeftmeijer.com/2011/comparing-images-and-creating-image-diffs/), he mentioned this method.
Further, how to compare two pixel colors? In my opinion, if the color of two pixels are different, they are probably different. This guy mentioned one method called "Delta E" (http://en.wikipedia.org/wiki/Color_difference#CIE76).
Colors could be defined by color space, Delta E use "Lab" (http://en.wikipedia.org/wiki/L*a*b*). This space has three specifications. The pic from Wikipedia shows them. L: -black<->white+, a: -green<->magenta, b: -blue<->-yellow.
Then, the differences of colors are calculated by the formulator from wiki:
E(a,b)=sqrt((L1-L2)^2+(a1-a2)^2+(b1-b2)^2)
I think if the E value is too high, the pixel is different and the pics are probably different.
I also searched internet, there are many posts talked about comparing of two images, do you guys has any ideas of how this app works?
Sunday, April 21, 2013
I don't always buy beer, but when I do I buy diapers
There aren't many data mining memes but the most prominent one involves the correlation between the sales of beer and diapers. These associations could be due to a number of reasons, including the following:
-That a post pregnant women is too tired or too weak to carry a giant bag of diapers, so they send their burly husbands out to pick some up. While they are out, they realize that they are off work and it is now time to put down a few beers, so while he's out he picks up his favorite brew.
-A poo-poo happens late at night. While the mother watches out and takes care of the baby, the husband is RUSHED to the grocery store to pick up some diapers. Being annoyed, he also picks up a 12 pack to relax.
-Just the the convenience factor. Well I don't have any beer at the house guess I'll pick some up while I'm out.
When doing studies on how these 2 sales relate you have to be careful with the correlation doesn't equal causation phenomenon. Without having a background in statistics and hypothesis testing, you can easily reach a misleading conclusion. This is a shame because there is much power in discovering how things relate. If this relationship of diapers and beer appears to be associated to a large enough percentage, businesses should react in a proper manner. Possibly putting the diapers and beer shelves closer to one another would so dramatic change in sales (and show further association among the 2.)
source: http://blog.patternbuilders.com/tag/retail-analytics/
How Obama is gonna get what he wants
President Obama has big plans for his next term and big vision for efficient energy. In the state of the union address he challenged every US citizen to do more with the energy they use and become more efficient. President Obama states, “Let’s cut in half the energy wasted by our homes and businesses over the next 20 years." When performing this radical of a change, it is estimated that it would create 1,300,00 jobs as well as reduce the yearly household electrical bill by $1000 . While yes, construction is beginning to become more efficient with the use of, LEED-certified silver, gold and platinum buildings. We have hardly tapped into what we are capable to reaching. In order to reach the massive savings in energy consumption, we have to optimize the operation of current urban life.
This is where Big data analytics come in to play. In analyzing how buildings operate, we first need a way to gather data concerning electrical output. Currently, there is no technology that fits into this role without paying a large chunk of money. One company, Seto, has come to the rescue. Seto designed a divice called a WebMeter. This device is easy to install, has a low cost, and monitors the electric flow of up to 36 individual circuits in a building circuit board. Readings from this device are recorded on a server chosen by the user and can be accessed at any point of operation. Now that this new technology has been released, companies can now dive into the applications of big data concerning energy consumption.
source: http://theenergycollective.com/tyhamilton/187006/big-data-key-unlocking-big-gains-energy-productivity
Visualization TedTalks: data changing views like never before
This is a TedTalks presentation given by David McCandless on
various data visualizations that I just found very interesting. He showed
visualization on “global media panic” over time and to what level the media was
reporting it. The Killer wasp graphic
was quite funny. He goes on to show one particular media panic, violent video
games, has a cyclic pattern occurring over time in which every November and
April, there is a peak. He claims that the November peak may be due to there
being a surge in video games coming out for Christmas, while April is an
anniversary for the columbine high school tragedy.
At 9:20
in the video, he shows a visualization representing an analogy for “bandwidth
of the senses” which from my understanding basically says “if your senses were
computers, this would be their bit rates.” The visualization clearly shows that
eyesight has the highest “bandwidth.” He goes on to make more analogies about
the senses by saying that the throughput (bit rate) for the eyes is analogous
to the throughput of a computer network whereas the throughput of taste is only
at par with that of a basic pocket calculator.
Another
Data set that he brings into an interesting visualization is a comparison of
GDP to military budget. Given I always thought the US to be a military
powerhouse, if you compare its GDP to military spending; it isn’t actually at
the top. It has the 8th highest Military budget to GDP ratio.
He
wraps up the presentation by saying there are many information problems as in
having just far too much data, and visualizations provide a quick solution
where large amounts of data can be understood very quickly to solve such a
problem.
Association Rules with RapidMiner
Here is a video I made on how to do Association rules among text documents in RapidMiner
Saturday, April 20, 2013
John Deere's Future in Big Data
With the world’s populations rising and expected to increase
another two billion in the next thirty-five to forty years, food production
must increase by 60% to sustain this massive amount of people (UN Food and Agriculture
Org.). This article on BigData Startup’s
website highlights that John Deere has entered the realm of big data with the
introduction of new products.
John Deere has recently put sensors on their newest pieces of
equipment that helps decrease downtime of the equipment and decrease fuel
usage. It is also possible now to get up-to-date data on weather, soil
conditions, and crop features to be communicated directly through the equipment
to farmers. This allows for the users to know the best locations and times to
harvest their crops in order to maximize the productivity and efficiency of
their land.
While not using as many data points as corporations such as
Wal-Mart or Amazon, they are working on increasing their data to process, because
more data generally equates to more uses and a more accurate analysis. In order
to do this John Deere is using the programming language R. Once R is used to
forecast information such as demand and crop yield, it then exports the
information to channels such as FarmSight, FarmSight Mobile Farm Manager, and
MyJohnDeere.com. The article states that FarmSight is used to increase
productivity by improving on three different areas; machine optimization, logistics,
and assisting in decision making. With John Deere keeping with the times, and investing in the possibilities in big data, their future looks promising.
At the very end of the article there is an interesting video made by John
Deere following a day in the life of a farmer with future John Deere technology.
Information from:
Friday, April 19, 2013
Rapid Miner Data Aggregation Tutorial
I have used the aggregation function in Rapid Miner many times while working on David and I's fantasy football project. It is very useful for compiling weekly statistics of athletes. Here is a step by step tutorial of how to use it. First open up Rapid Miner and begin a new process. Next import your data. In my case I am importing an Excel sheet that contains weekly NFL QB stats from 2008-2011.
After selecting your mode of import choose the correct sheet or file to import.
Select your sheet and click Next.
The next page (step 3 of import wizard) will ask if you want to make any annotations. This is not necessary for this data so I will go ahead and click next which brings us to step 4 shown above. De-select any columns that you do not want to import. In this case I do not care to see what teams the QBs play for. You MUST make the column you want to sort by ID. You can see in the first column that contains the names was changed from attribute to ID. After you do that click Next and save your data.
Once you get into your main process drag and drop your data onto process area. Then look to the left hand side. Click Data Transformation > Aggregation. Drag and drop the aggregate widget onto the process area. Next connect the out port of the data to the exa port on the left side of the aggregate widget. Then connect the exa port on the right side of the widget to the result port.
After you connect the ports select edit list by aggregation attributes. Here make an entry for each attribute you want to aggregate and select the functions you want to use. After you do this click Ok.
Next click select attribute by group by attributes. Here move your ID column (in this case Name) into the right side by selecting your ID and clicking on the arrow pointing right. Click Ok.
Now just click the play button on the toolbar and you get your results! From here you can export your data as you like or enter plot view or advanced charts. Hope this tutorial was helpful.
After selecting your mode of import choose the correct sheet or file to import.
Select your sheet and click Next.
The next page (step 3 of import wizard) will ask if you want to make any annotations. This is not necessary for this data so I will go ahead and click next which brings us to step 4 shown above. De-select any columns that you do not want to import. In this case I do not care to see what teams the QBs play for. You MUST make the column you want to sort by ID. You can see in the first column that contains the names was changed from attribute to ID. After you do that click Next and save your data.
Once you get into your main process drag and drop your data onto process area. Then look to the left hand side. Click Data Transformation > Aggregation. Drag and drop the aggregate widget onto the process area. Next connect the out port of the data to the exa port on the left side of the aggregate widget. Then connect the exa port on the right side of the widget to the result port.
After you connect the ports select edit list by aggregation attributes. Here make an entry for each attribute you want to aggregate and select the functions you want to use. After you do this click Ok.
Next click select attribute by group by attributes. Here move your ID column (in this case Name) into the right side by selecting your ID and clicking on the arrow pointing right. Click Ok.
Now just click the play button on the toolbar and you get your results! From here you can export your data as you like or enter plot view or advanced charts. Hope this tutorial was helpful.
How Data Crunchers Helped Obama Win Election
The 2008 presidential campaign set out by Obama's campaign team marked the first time that campaign strategies were based on quantitative data instead of hunches and subjectivity. With the success found in 2008, Obama's campaign team decided to invest even heavier in data analytics for the 2012 election.
Getting Money
With all the success found in 2008 using big data analytics, a major weakness also became apparent.. Too many datasets. So one of the first tasks to tackle for Obama's campaign team going into the 2012 election was merging all these data sets into one huge data set as well as expanding the data analytics team 5 times as big. Going into the 2012 campaign the goal was set to raise $1 billion dollars, a seemingly large amount of money. They accomplished this however by strategically targeting possible donors through emails and attractive fundraising events. An example of this strategy came "In late spring, the backroom number crunchers who powered Barack Obama’s campaign to victory noticed that George Clooney had an almost gravitational tug on West Coast females ages 40 to 49. The women were far and away the single demographic group most likely to hand over cash, for a chance to dine in Hollywood with Clooney — and Obama." A very interesting approach to say the least.. The same was done on the east coast and Sarah Jessica Parker was the celebrity chosen to use at a fundraising event.
Swing Voters
In order to gain insights in behaviors of swing voters, the campaign team obtained huge amounts of polling data in swing states. This proved to be a huge advantage as they were able to allocate resources more efficiently. They also dipped into social networks such as Reddit and Facebook because they found these to be very successful in swaying swing voters. Another unique marketing strategy that came from their big data analytics was the decision to target ads during tv shows instead of the past method of airing ads between news programs.
With the huge success Obama's campaign team experienced in 2012 data mining and using big data analytics, a new way of quantitatively developing campaign strategies was born and should prove to be the norm going into the future.
http://swampland.time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-crunchers-who-helped-obama-win/2/
Getting Money
With all the success found in 2008 using big data analytics, a major weakness also became apparent.. Too many datasets. So one of the first tasks to tackle for Obama's campaign team going into the 2012 election was merging all these data sets into one huge data set as well as expanding the data analytics team 5 times as big. Going into the 2012 campaign the goal was set to raise $1 billion dollars, a seemingly large amount of money. They accomplished this however by strategically targeting possible donors through emails and attractive fundraising events. An example of this strategy came "In late spring, the backroom number crunchers who powered Barack Obama’s campaign to victory noticed that George Clooney had an almost gravitational tug on West Coast females ages 40 to 49. The women were far and away the single demographic group most likely to hand over cash, for a chance to dine in Hollywood with Clooney — and Obama." A very interesting approach to say the least.. The same was done on the east coast and Sarah Jessica Parker was the celebrity chosen to use at a fundraising event.
Swing Voters
In order to gain insights in behaviors of swing voters, the campaign team obtained huge amounts of polling data in swing states. This proved to be a huge advantage as they were able to allocate resources more efficiently. They also dipped into social networks such as Reddit and Facebook because they found these to be very successful in swaying swing voters. Another unique marketing strategy that came from their big data analytics was the decision to target ads during tv shows instead of the past method of airing ads between news programs.
With the huge success Obama's campaign team experienced in 2012 data mining and using big data analytics, a new way of quantitatively developing campaign strategies was born and should prove to be the norm going into the future.
http://swampland.time.com/2012/11/07/inside-the-secret-world-of-quants-and-data-crunchers-who-helped-obama-win/2/
NFL Draft Data Visualization
After working on David and I's fantasy football project I was doing some research on the NFL draft prospects. The main website we looked at for draft prospects and their ratings is scout.com. This is the same rating site that ESPN uses. They have a list of every notable player entering the draft along with a rating of 1-5 stars, college attended, and hometown. Fadel showed us earlier in the semester a visualization from ESPN that showed a heat map of where high school football recruits come from and how it has changed over the years. < http://espn.go.com/blog/playbook/visuals/post/_/id/10989/graphic-football-recruiting-then-now >. This visualization inspired me to create a heat map of where NFL draft prospects hail from, but I also want to expand on that. First this is a map simply showing where the NFL draft hopefuls are from. The darker states are the states that have more prospects.
As you can see this is very similar to the visualization that ESPN provided. Makes complete sense. California, Texas, Florida, Georgia, and Ohio rank as the highest states. These states are also extremely populous. I decided to make a chart showing the draft prospects per capita. This may give a better indication of which state provides the most "bang for your buck" talent wise.
Looks like the hotspots have changed. Louisiana, South Carolina, Georgia, Kansas, and surprisingly Hawaii look to be the highest per capita states. Do these states just naturally have a lot of talent? Possibly. It may also be a factor of players from these states going to in state schools that have great NFL player development. LSU recruits a lot of in state players and are also great at developing next level talent. The same goes for Clemson and South Carolina. Also not surprisingly the north mid west and the north east do not provide any NFL talent.
As you can see this is very similar to the visualization that ESPN provided. Makes complete sense. California, Texas, Florida, Georgia, and Ohio rank as the highest states. These states are also extremely populous. I decided to make a chart showing the draft prospects per capita. This may give a better indication of which state provides the most "bang for your buck" talent wise.
Looks like the hotspots have changed. Louisiana, South Carolina, Georgia, Kansas, and surprisingly Hawaii look to be the highest per capita states. Do these states just naturally have a lot of talent? Possibly. It may also be a factor of players from these states going to in state schools that have great NFL player development. LSU recruits a lot of in state players and are also great at developing next level talent. The same goes for Clemson and South Carolina. Also not surprisingly the north mid west and the north east do not provide any NFL talent.
Google's Data Navy
Based on a conversation I had with a friend and a comment I did in one of my classmate's blog post, I wanted to expand on this concept and its implications (plus couple of conspiracy theories).
Google's Data Navy *
Like any country, I guess Google is planning to arm itself with its own navy of "data center boats" and go to war in the information/data arena. Back in 2008, Google filed for a patent that described a "water-based data center." These new "water world" data centers will give Google the ability to get computer centers closer to some "hard/costly to reach" customers. In addition, the patent presents the possibility of these ships generating their own electricity from ocean water and being completely autonomous. I am sure that environmental groups will give them a lot of support and free advertisement. However, the question is what is Google's real intent with such an ambitious development?
Microsoft seems to be working in a very similar idea; they are in the middle of creating a massive/mother data center made of several data centers inside standard shipping containers (metal containers) to maximize the use of space. Microsoft's containers have the advantage of being "modules that can be moved around to get the most computing power possible per square foot."* Other major server manufacturers like HP, IBM, Dell, and Sun Microsystems "have created their own data centers in shipping containers that they sell to service providers, the military and research labs."* Google had considered the idea of using containers in the past; the company filed a patent on the containers and they even built a prototype system in the garage of their Mountain View, California headquarters.
Google's arguments for the new technology is the following:
1) "Bring data closer to customers"*
2) "Floating data centers could aid the military or help out during a large event"*
What about purposes 3, 4, 5, etc? Do we really believe these are the only two purposes?
At the end of the day, all this "power" (I personally see information as power) is extremely dangerous in the hands of a few. It really scares me to see companies like Google and Walmart which have Profits that, if considered countries, could be in the top 30 - 50 GDPs in the world. Since Google owners already have so much money (each of them has a "customized" Boeing 727), would the company continue to live in gray areas and start considering unethical and aggressive use of data? In "lawless" international waters, could this become a paradise for the corrupt?
The following articles/blogs are concerning; the following publications already display several unethical behaviors from Google; what's next? Manipulate countries into Wars? It has been said that the third world war will be "ethnic based" - would the fourth world war be "information based"? This is very scary stuff.
Google Faces Antitrust Complaint In India For Possible Unethical Practices
http://www.ibtimes.com/google-faces-antitrust-complaint-india-possible-unethical-practices-702224
Google PR Nightmare: Search Giant Apologizes for Evildoing
http://www.searchenginejournal.com/google-misuses-mocality-database/38789/
Google Still Dealing with Unethical Behavior Allegations in Italy and Beyond
http://siliconangle.com/blog/2011/01/18/google-still-dealing-with-unethical-behavior-allegations-in-italy-and-beyond/
Search & Destroy - Why you can't trust Google!
http://www.searchanddestroybook.com/book.php
Main Reference/Source
* This post is based on New York Times blog post called "Google's Search Goes Out to Sea"
Google's Data Navy *
Like any country, I guess Google is planning to arm itself with its own navy of "data center boats" and go to war in the information/data arena. Back in 2008, Google filed for a patent that described a "water-based data center." These new "water world" data centers will give Google the ability to get computer centers closer to some "hard/costly to reach" customers. In addition, the patent presents the possibility of these ships generating their own electricity from ocean water and being completely autonomous. I am sure that environmental groups will give them a lot of support and free advertisement. However, the question is what is Google's real intent with such an ambitious development?
Microsoft seems to be working in a very similar idea; they are in the middle of creating a massive/mother data center made of several data centers inside standard shipping containers (metal containers) to maximize the use of space. Microsoft's containers have the advantage of being "modules that can be moved around to get the most computing power possible per square foot."* Other major server manufacturers like HP, IBM, Dell, and Sun Microsystems "have created their own data centers in shipping containers that they sell to service providers, the military and research labs."* Google had considered the idea of using containers in the past; the company filed a patent on the containers and they even built a prototype system in the garage of their Mountain View, California headquarters.
Google's arguments for the new technology is the following:
1) "Bring data closer to customers"*
2) "Floating data centers could aid the military or help out during a large event"*
What about purposes 3, 4, 5, etc? Do we really believe these are the only two purposes?
At the end of the day, all this "power" (I personally see information as power) is extremely dangerous in the hands of a few. It really scares me to see companies like Google and Walmart which have Profits that, if considered countries, could be in the top 30 - 50 GDPs in the world. Since Google owners already have so much money (each of them has a "customized" Boeing 727), would the company continue to live in gray areas and start considering unethical and aggressive use of data? In "lawless" international waters, could this become a paradise for the corrupt?
The following articles/blogs are concerning; the following publications already display several unethical behaviors from Google; what's next? Manipulate countries into Wars? It has been said that the third world war will be "ethnic based" - would the fourth world war be "information based"? This is very scary stuff.
Google Faces Antitrust Complaint In India For Possible Unethical Practices
http://www.ibtimes.com/google-faces-antitrust-complaint-india-possible-unethical-practices-702224
Google PR Nightmare: Search Giant Apologizes for Evildoing
http://www.searchenginejournal.com/google-misuses-mocality-database/38789/
Google Still Dealing with Unethical Behavior Allegations in Italy and Beyond
http://siliconangle.com/blog/2011/01/18/google-still-dealing-with-unethical-behavior-allegations-in-italy-and-beyond/
Search & Destroy - Why you can't trust Google!
http://www.searchanddestroybook.com/book.php
Main Reference/Source
* This post is based on New York Times blog post called "Google's Search Goes Out to Sea"
http://bits.blogs.nytimes.com/2008/09/07/googles-search-goes-out-to-sea/
Thursday, April 18, 2013
DM Project - Super Bowl Teams
The following video showcases the analysis of the distance education group consisting of Jay Long, Julian Olander, and Wesley McDonald.
Machine Learning: the Bridge to AI
Given that
machine learning essentially deals with developing algorithms and systems
which “learn” from data, it is not surprising that machine learning has a great
deal of applications in the development of artificial intelligence. Being a
science fiction fan, it is quite possibly a bridge to autonomous, bipedal,
humanoid, robots which hopefully won’t end up like terminator.
Here is a
lecture from UC Berkeley from the electrical and computer engineering
department from UC Berkeley.
At the beginning of
the lecture, he starts off by explaining an experiment in which he attempts to
train a remote control model (not the real big kind) helicopter to perform some
rather intricate maneuvers which can normally only be done by expertly skilled
model helicopter pilot.
The data collected, as I imagine being the training set for
this scenario is the position, velocity, orientation and the angular rate of
the helicopter. They also collected data on the controller inputs. He shows
data on the trajectories of various iterations of the same helicopter air show
in order to gather data from that in order to apply the machine learning, also.
Interestingly, he uses the Needleman Wunsch sequencing alignment algorithm
across the iterations of the human trial trajectories in order to get a
learning trajectory I imagine he will be feeding to the machine learning
algorithm (I discussed Needleman Wunsch in one of my previous post, but it had
to deal with bioinformatics, and not AI).
He mentions that the primary purpose of machine learning in
this context is to train a computer to do the things which is generally
associated with the same learning a human pilot would undergo, (I.e., a human
pilot spends many hours training and doing the same maneuvers until it becomes
“muscle memory” and he may know exactly what some action on the controls will
do before he does them.)
Around 27:46 in the video, he begins
lecturing on how machine learning can be used to train robots to perform
surgery. The surgery however is not terribly crazy like a brain transplant. It’s
just a basic surgical know tie. At the end of the video, he goes into
demonstrating how he applied machine learning to teach a quad pedal robot how
to walk across a rough surface. Just check out the video.
The lecturer in the video is
Pieter Abbeel, Department of Electrical Engineering and Computer Sciences, UC Berkeley
Pieter Abbeel, Department of Electrical Engineering and Computer Sciences, UC Berkeley
Data Mining Project - Fantasy Baseball Analytics
Project Team Members:
Jason Buckner
Sam Green
Chris Shaw
Justin Willette
Introducing Global Terrorism Data
Global Terrorism Database is one of the data source introduced in the first class. This database (GTD) has been managed by University of Maryland. It has records from 1970 through 2011 and keeps on piling up recent data related to terrorism activities. It is provided with csv file format, which is plain text, and when transporting it to Excel, the data set will be shown like below.
Each record consists of 98 attributes including ID, time, location, target, weapon, and so on. It is basically a time-series data set. Considering this property, GTD site also provides a graphical interface with this data set, called GTD Data Rivers which generates and shows a diagram like a big flow of water.
The Combo box at the top of the screen shows the list of attributes users can choose. The main part of the screen demonstrates the change of terrorism activities corresponding to the choice users made over time. Adjusting the red bar at the left bottom, users can narrow down the focus they want to see. The right bottom part of the screen illustrates the amount of each element in the Data Rivers above.
Reference: http://www.start.umd.edu/gtd/
Tuesday, April 16, 2013
Kaggle competition, Titanic: Machine Learning from Disaster
This video is about the kaggle completion called Titanic: Machine Learning from Disaster.
In this video, we compared 4 different models to solve the problem. The models are:
Team Members:
Shahab Derhami
MohammadNaser Ansari
In this video, we compared 4 different models to solve the problem. The models are:
- Regression with MINITAB
- Regression with RapidMiner
- Decision Tree with RapidMiner
- Artificial Neural Network with MATLAB
Team Members:
Shahab Derhami
MohammadNaser Ansari
Monday, April 15, 2013
Basic Tableau Tutorial
This is a pretty basic tutorial about how to load and do basic functions and visualization in Tableau.
Here is a link to the free trial of the software:
http://www.tableausoftware.com/trial-resource-center
Here is a link to the free trial of the software:
http://www.tableausoftware.com/trial-resource-center
Google Geo Chart Tutorial
I have been playing around with Geo Charts using Google Spreadsheets and thought I would post an easy straightforward tutorial for anyone who has not tried it out yet. Start by logging onto your Google account and clicking Drive on the Toolbar. Next click spreadsheet.
After this paste your data into columns A and B. Column A must hold the locations and column B must hold the value. These locations can be countries or states. For my example I am showing the number of NFL draft prospects for each state.
Click update and you are done! Very simple visualization. Also has a global map and maps for each continent.
After this paste your data into columns A and B. Column A must hold the locations and column B must hold the value. These locations can be countries or states. For my example I am showing the number of NFL draft prospects for each state.
After this select columns A and B then click Insert > Chart.
Once you click chart select Charts then Map. Also make sure the Geo Chart regions choice is selected. Then click Insert.
After clicking insert you will see the box graph show up. Click on the arrow in the upper right hand corner then click on Advanced Edit.
Almost done. Once you get to this edit area change the region to match your list of locations then choose your color scale for min, mid, max, and no values.
Subscribe to:
Posts (Atom)

.png)
.png)
.png)
.png)
.png)
.png)
.png)







