Wednesday, April 3, 2013

Web Crawling with RapidMiner



For this blog post I am going to show you how to use RapidMiner to crawl a webpage for you.  First, when you open up RapidMiner you have to make sure you have the Web Mining extension installed.  If not, click on the Help menu at the top of the screen and click on "Update RapidMiner"
Then select and download the Web Mining Extension.

Once you have the Web Mining Extension downloaded, open the Web Mining folder under the Operators sections and then select and drag Crawl Web onto the Process section.



Once you have done this, you have to chose a website to crawl. How about we crawl this very blog.  So, we copy and paste the url into the url box on the right side of the screen, under the parameters tab.


Then, you have to select an output directory for RapidMiner to save your files to. I've just chosen a folder on my desktop called "tmp". Then you want to select a file extension, I've chosen .txt.  RapidMiner will save the files it crawls as text files.  The Max depth is how many consecutive links the crawl will follow, I've chosen the default of 2.  The domain tells you if the crawl will stay on the same server or allow it the crawl the entire web, I've left it as the default of web.  Also I set the max threads, which is the number of CPU cores the crawl will use, I have set it to 4 in order to speed up the crawl.  Then I have changed the user agent to that of my browser, to do this just go to http://whatsmyuseragent.com/ and copy and paste your user agent into the box. 
 
Now we need to set up some crawling rules. So, click on the button next to crawling rules that says "Edit lists" 
Then a dialogue box will open.
As you see, you can add crawling rules.  The 1st rule establishes which links for the crawl to follow.  I have set it up to .+auburnblogspot.+.  This allows for it to follow any link with auburnblogspot in the url. The .+ says any amount of characters before and after auburnblogspot to be in the url.  The 2nd rule saves only the pages that have auburnblogspot in the url.

Ok, everything should be set.  Hit the play button, RapidMiner will crawl the web and save the webpages to the file you specified.  Then you can perform your analysis on your saved files.

Your Data


An aspect that hasn’t been touched on in regards to Big Data is the handling of consumers’ data exhaustion. The average consumer has no idea how companies are acquiring and distributing their data. There are ‘data brokers’ that have acquired all of our data, yet refuse to mention how it was attained. Coupled with perpetually changing privacy policies, it is almost as though the companies are attempting to encrypt the handling of data.

California’s Right to Know Act (AB 1291) is attempting to address this problem.  This will require a company to give access to users’ data that have been stored and which companies, online or offline, it has been sold it upon request, free of charge. Currently, California law dictates that a customer has the right to acquire a list of companies that have your personal data for marketing purposes (junk mail, spam, etc).

The new act, coupled with their preexisting law, will bring California’s transparency law into the digital age.  Users will be able to track how their information is being trafficked (online ads, data brokers, 3rd party apps) and the flow of their data from online interactions.

“It’s no just about knowing what a company is sharing, it’s about knowing what a company is storing.”

Let it be known that these new laws will not prohibit the selling and transferring of data between companies or provide additional security measures for storing and anonymization. The act is merely about transparency and access for the user.

The new law has three safeguards to ensure compliance of smaller startups:

  1. They may choose to not store unnecessary data
  2. Provide automated user notification when data is disclosed, if it is too cumbersome for the company to respond to each individual request.
  3. They only have to provide the user an accounting per 12 months.




Though this is a very important step for our digital age, I can’t help but to compare this to YouTube’s situation with their users with regards to them having to pay their users for their content. Might be a hard metaphor to follow, but we, the people, are providing these companies fine-tuned profiting opportunities. Going back to YouTube, users eventually started getting paid for their content, as they were the reason for the sites increased traffic, ad revenue. See where I’m going?

I suspect that one day, if it’s not already, a popular question would be, “Where’s my check?”

Article: https://www.eff.org/deeplinks/2013/04/new-california-right-know-act-would-let-consumers-find-out-who-has-their-personal

Big Data -Travel Industry

Big Data for the Travel Industry

Big Data is applied in several ways and one of its applications rarely discussed are the infinite possibilities to the travel industry. It can effectively help make the customer experience better and lead to better sales. Big Data means a lot of datasets that are beyond the capabilities of a typical dataset and analytics are the interpreting the technologies that enable the meaning of the large voluminous data.
A lot of data is encountered in travel. A prime example would be the analytics logs of an online travel agency. For years, analytics tools have enabled companies to keep track of detailed demographic statistics, and other pertinent information such as which pages convert the best, have the highest bounce rates, etc. With the advent of cloud storage and web services, the proliferation of cheap storage as well as distributed file systems that allow storage across dozens of commodity computers enables the cheap and efficient storage of petabytes of data without massive cost. This will provide travel agencies the capacity to handle more data. This provides them a larger dataset or data points to tell them which areas to focus their drive upon and what kind of products they would advertise and what audience they are specifically targeting.
The state of big data in travel will serve as a technological primer and will improve the services in the travel ecosphere. The ability of big data technology to enable us to find intelligence in vast amounts of data presents a clear, massive opportunity to reshape the way consumers are marketed and sold to in travel.
Some of the cool applications the travel industry is heading to are discussed below. Big data applications are moving from profiling to true personalization. For example, true personalization would enable a site to recommend a specific hotel to a specific traveler based on their specific wants, needs, and previous purchase patterns, rather than a generic set of recommendations based on the type of traveler.
Geo-fencing, the process of knowing when a traveler is near a certain attraction or vendor, is starting to emerge.

An example of this is the recently launched Foursquare Radar feature, which alerts you when you are near a place you at one time wanted to be reminded of. This technology is pure big data: gathering your coordinates in real time via your mobile phone’s GPS and realizing when you are in a certain boundary.
Another application which is I have heard of is the ‘meet and seat’ applications which is being implemented by certain airlines. My dad who has worked extensively in the travel industry has told me several instances where this initiative was discussed in Travel Technology conferences a few years ago. Travel companies have acknowledged the fact that data is extremely valuable to their success and that they can take this data and use it to build innovative ideas and applications and increase their performance on a whole. People usually spend a lot of time on long haul flights and this can be made productive by letting people choosing to sit next to someone with whom they share a lot in common with or with someone  with whom they can connect with on a professional level and use it a business opportunity. This can be easily implemented by incorporating LinkedIn or social networking profiles into the online ticketing portal and users have the opportunity to choose their travel companion or simply opt out of the service. This is only a brief explanation to what this kind of application can generate and there are many extended application to this.

As mentioned Big Data can provide infinite possibilities to the travel industry which is always encountering challenges and is looking for innovative ideas. This is a topic which has tremendous potential for discussions this is just a brief glimpse of the numerous opportunities.



References: Alex Kremer-Tnooz Talking Travel Technology








Big Data requires better security


In almost all businesses that deal with large volumes of data, IT departments are starting to deal with issues around big-data deployments. However, one issue that is starting to concern IT more and more is security, especially as big-data analysis usually requires access to thousands of pieces of personal information, including social security and credit card numbers.
IT already knows that these massive datasets can cause problems. In fact, 80 percent of Apache Hadoop users want to know if there is sensitive data stored in their environment, while 77 percent know it's important to protect sensitive data within a big-data deployment and control who has access to that data.
These and other findings come from a new survey released earlier this month by Dataguise, a company that makes security intelligence and other data protection tools. The report involved more than 60 different enterprise users who attended either the recent RSA conference or the O'Reilly Strata Conference.
The big-data security report specifically focused on Hadoop users and not other types of big-data environments such as Riak.
IT departments that use Hadoop should be aware of storing data from several different sources as part of their big-data analysis since this can lead to a number of unforeseen problems and security issues.
Since there are no easy answers, it's at least best to stay aware of what your company is collecting.
The challenge for IT is keeping track of how other departments, such as marketing, sales, and other divisions, are using big-data and what information and datasets they want analyzed as part of their project. According to the report, 33 percent of businesses store sensitive data within their Hadoop environment, including social security and credit card information.
What other types of data are within these Hadoop environments? About 55 percent of participants reported that their company is storing log files, while 36 percent store some type of structured database management system (DBMS) data, and another 24 percent have mixed data types.
The Dataguise report does offer some practical, if rather simple, advice for those IT departments dealing with big-data deployments and who want to ensure that the privacy of the data they are using is protected and within compliance. These include:
  • Making sure that IT managers have the ability to locate and identify sensitive data across different big-data clusters, so that they can inform management of any potential risk.
  • IT should make sure that any security tools, including data masking and data quarantine, remain a priority.
  • Finally, IT should make sure their big-data environment can be centrally managed, with scheduled detection and protection features deployed throughout the clusters that ensure the environment meets compliance rules and regulations.
Source:  http://www.enterpriseconversation.com/author.asp?section_id=2669&doc_id=260974

The Hidden Biases in Big Data

     With big data hype reaching new found heights, Kate Crawford of the Harvard Business Review examines potential faults in the analysis of large data sets.  According to her, "The hype becomes problematic when it leads to what I call "data fundamentalism," the notion that correlation always indicates causation, and that massive data sets and predictive analytics always reflect objective truth."

      Hidden biases in collection and analysis of data present risks that must be accounted for in the big picture of big data.  For example, consider the twitter data generated from Hurricane Sandy;  20 million tweets from Oct 27 to Nov 1.  Examining the data showed that the majority of tweets came from Manhatten, which could lead one to believe Manhatten was affected the most.  As power outages and batteries lost charge, even fewer tweets came from the harder hit areas, skewing the data even more.   Kate Crawford suggests situations like this to be a "signal problem" within big data.

     Another example of this "signal problem" is in Boston where a Streetbump smartphone app was developed to get citizens involved in reporting potholes spotted on city streets.  This "signal problem" shows because of the lack of smartphones found in urban areas and also with the elderly.  For Boston, this means they are missing sources of input from a significant part of their popuation.

     So what should big data scientists do to avoid these hidden biases? In the short term Kate Crawford suggests they should, "take a page from social scientists, who have a long history of asking where the data they're working with comes from, what methods were used to gather and analyze it, and what cognitive biases they might bring to its interpretation." In essence big scientists must first ask the question "why?" and not "how many?"  Only then will the depths of big data be revealed.
     
http://blogs.hbr.org/cs/2013/04/the_hidden_biases_in_big_data.html

Easy Way to Collect Twitter Posts

Let’s say that you a working on a project and wanted to gather the latest twitter post containing a certain word or sets of words. There is a simple web URL that you can use that will return this information.
Let’s say that I want to see all the latest post that contains the word “auburn”. I can paste the following URL in to any web browser and twitter will send me the last ten tweets containing the word “auburn.”

http://search.twitter.com/search.json?q=auburn&result_type=recent

What will return is posted below. Stripping this return of HTML can be done using RapidMiner or other software packages.



The last part of the URL (“&result_type=recent”) returns the most recent posts. If you remove this you will get a mix of the most recent post and some of the most popular tweets.

By default this will only give you the last 10 posts that twitter finds containing “auburn.” If you want to increase the amount of posts that are loaded you can add to the end of the URL. Let’s say that I want it to load the most recent 100 posts you would just add “&rpp=100” where the rpp represents results per page and the 100 is the number we wish to see.

http://search.twitter.com/search.json?q=auburn&result_type=recent&rpp=100

If you want to search for multiple terms at a time then you can do that two different ways. First you can add all the terms together in quotation marks (example: search for " auburn tigers" by using q="auburn tigers"). This method will automatically transform it to the second form: %22auburn%20tigers%22". The second form should be used when you are going to add further search parameters after the query text. 

The data that is returned contains a lot of information that might not be useful for your project but some of it is interesting.  The following is the results for the most recent singular twitter post containing “auburn”.

{"completed_in":0.056,"max_id":319540687655825409,"max_id_str":"319540687655825409","next_page":"?page=2&max_id=319540687655825409&q=auburn&rpp=1","page":1,"query":"auburn","refresh_url":"?since_id=319540687655825409&q=auburn","results":[{"created_at":"Wed, 03 Apr 2013 20:03:31 +0000","from_user":"M0******","from_user_id":21097****,"from_user_id_str":"21097****","from_user_name":"M*******M******","geo":null,"id":3195406876********,"id_str":"3195406876*********","iso_language_code":"en","metadata":{"result_type":"recent"},"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","source":"<a href="http:\/\/twitter.com\/download\/iphone">Twitter for iPhone<\/a>","text":"RT @AuburnUPC: We are excited to announce the Auburn Airwaves concert line-up!  We are presenting Train, Hot Chelle Rae, and Green River Ordinance."}],"results_per_page":1,"since_id":0,"since_id_str":"0"}

The first and last line of the results identifies what the search was looking for. Then the text describes the posts (some of the information was redacted for the purposes of this post for the privacy of the random user whose tweet I got for the example):

  • When the post was created: Wed, 03 Apr 2013 20:03:31
  • Who the user was: ","from_user":"M0******(partially redacted for this post)","from_user_id":21097****(partially redacted for this post),"from_user_id_str":"21097****(partially redacted for this post)","from_user_name":"M(redacted for this post) M(redacted for this post)
  • In this case she doesn't geo tag her tweets but if the user did you would see where they posted from: ","geo":null,
  •  Her unique twitter ID number: 319540687********* (partially redacted for this post)
  •  The URL link to the users profile picture: "},"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1790229258\/image_normal.jpg","
  • That the post was submitted using the users iPhone: Twitter for iPhone&lt
  • Then the text of the post: ":"RT @AuburnUPC: We are excited to announce the Auburn Airwaves concert line-up!  We are presenting Train, Hot Chelle Rae, and Green River Ordinance."


Final Thoughts: I don't use twitter but for those of you who do, please know that this data is retrievable without you permission by anybody with an internet contention. I choose to redact the private information of the twitter user whose tweet I used as an example but anyone with a web engine could have gotten it.  We can use this for big data because twitter makes this available so that developers can access the data free of charge. So twitter users beware!!

There are many more search operators that can be used to record tweets and narrow the search. They can be found here: https://dev.twitter.com/docs/using-search

Sources:

This post contains information from the website: http://nealcaren.web.unc.edu/pizza-twitter-and-apis/. A great resource for information on Big Data as it relates to sociology. It is written by a professor from UNC Chapel Hill who uses social media to conduct research and his website has a large number of tutorials on Python and API.