Friday, March 29, 2013

k-Means Clustering Tutorial in RapidMiner



In this tutorial, I will attempt to demonstrate how to use the k-Means clustering method in RapidMiner. The dataset I am using is contained in the Zip_Jobs folder (contains multiple files) used for our March 5th Big Data lecture.
  •  Save the files you want to use in a folder on your computer.
  • Open RapidMiner and click “New Process”. On the left hand pane of your screen, there should be a tab that says "Operators"- this is where you can search and find all of the operators for RapidMiner and its extensions. By searching the Operators tab for "process documents", you should get an output like this (you can double click on the images below to enlarge them):

You should see several Process Documents operators, but the one that I will use for this tutorial is the “Process Documents from Files” operator because it allows you to generate word vectors stored in multiple files. Drag this operator into the Main Process frame.


  • Click “Edit List” beside the “text directories” label in the right-hand pane in order to choose the files that you wish to run the clustering algorithm on.



You can choose whatever name you wish to name your directory.


Click the folder icon to select the folder that contains your data files. Click “Apply Changes”.


  • Double click the “Process Documents from Files” operator to get inside the operator. This is where you will link operators together to take the (in my case) html documents and split them down into their word components (please note that you can run the K-Means Clustering algorithm with a different type of file). As highlighted in my previous tutorial, there are several operators designed specifically to break down text documents. Before you get to that point, you need to strip the html code out of the documents in order to get to their word components. Insert the “Extract Content” operator into the Main Process frame by searching for it in the Operators tab.

  • The next thing that you would want to do to your files is to tokenize it. Tokenization creates a "bag of words" that are contained in your documents. Search for the "Tokenize" operator and drag it into the "Process Documents from Files" process after the “Extract Content” operator. The only other operator that is necessary to include for appropriate clustering for documents is the “Transform Cases” operator; without this, documents that have the same words in different cases would not be considered as more distant (less similar) documents. You should get a process similar to this: 

  • Now for the clustering! Click out of the “Process Documents from Files” process. Search for “Clustering” in the Operators Tab:

As you can see, there are several clustering operators and most of them work about the same. For this tutorial, I chose to demonstrate K-Means clustering since that is the clustering type that we have discussed most in class. In RapidMiner, you have the option to choose three different variants of the K-Means clustering operator. The first one is the standard K-Means, in which similarity between objects is based on a measure of the distance between them. The K-Means (Kernel) operator uses kernels to estimate the distance between objects and clusters. The k-Means (fast) operator uses the Triangle Inequality to accelerate the k-Means algorithm. For this example, use the standard k-Means algorithm by dragging into the Main Process frame after the “Process Documents from Files” operator. I set the k value equal to 4 (since I have 19 files, this should give me roughly 5 files in each cluster) and max runs to about 20.


  • Connect the output nodes from the “Clustering” operator to the res nodes of the Main Process frame. Click the “Play” button at the top to run. Your ExampleSet output should look like this:



 By clicking the folder view under the “Cluster Model” of the output, you can see which documents got placed into each cluster.



If you do not get this output, make sure that all of your nodes are connected correctly and also to the right type. Some errors are because your output at one node does not match the type expected at the input of the next node of an operator. If you are still having trouble, please comment or check out the Rapid-i support forum.

Tutorial: Web Scraping on Google Spreatsheet

Web scraping is a very useful technique to collect information from different url in the same webpage.
Web crawling in Rapid miner cannot process all kinds of rules so that I use google spreadsheet to make it easier to collect information and then import to Rapid miner to do the next step. Here is a tutorial made by me.

Tutorial: Web Scraping on Google Spreatsheet

Tutorial 4-How to Create a Polyviz widget using Orange

Polyviz widget in Orange
Polyviz is a visualization technique used in Orange where the various data points are related to anchors with value dependent positions. A comparison of various anchors or attributed can be made visually showing the comparison of data points by pinpointing each data point with respect to its attribute. It can be applied in electoral analysis, analyzing spread of epidemics, sales distribution of goods and so on. In the tutorial the data set shows the different age groups and their type of prescription lens, whether they are astigmatic (caused by irregular shape of cornea) tear rate of the lens and finally type of lenses used. If they are astigmatic a specific type of lens known as toric lenses must be used. I have used screen shots to develop this tutorial as my previous tutorial shows how to bring in the dataset into Orange and extend it to the Data table and bring widgets into the scheme in Orange. This is on similar lines but however uses the Polyviz widget under the Visualize category. The first picture is the of dataset used generated by the Data Table widget. It shows how the data is categorized (the picture shows only the first few lines of the data set and is followed on similar lines).
 
The second picture shows all my widgets used in the scheme of this project. The scatterplot, and distributions are only used for inference, however the polyviz widget can be built without any of these widgets as well.
 
Once I give the data signal as input to the Polyviz widget, it allows us to visualize the data in many interesting ways. It assigns characteristics to each side of a polygon and automatically creates a scale and plots the data points. As shown in the picture below age, astigmatism and tear rate are the attributes compared with the type of lens color coded for easy comprehension. Various combinations such as *young and not astigmatic*   *myope and who is in the age bar pre-presbyopic* and so on can be analyzed with respect to lenses. The data points can be accesed by clicking on them and from the polyviz widget they can visualized as per the format we would like to subject to availability on the Orange widget tab.
Various combinations can be visualized by adding and removing the different parameters in the dialog box on the right hand side of the polyviz window. Polygons of any different sides can be developed by adding or removing parameters.


The intuitions gained form the Polyviz widget can be tested using various other widgets in Orange. Interesting correlations can be made with the given data set using the Polyviz widget.

M2M - Future of Code



Machine-to-Machine Technology





                How far can big data go? What is next for big data analytics? According to GCN, the next horizon for big data may be machine-to-machine (M2M) technology. As coding of big data advances, Oracle is now considering big data “an ecosystem of solutions” that will incorporate embedded devices to do real-time analysis of events and information coming in from the “Internet of Things,” according to the Dr. Dobbs website. There is so much data that is being generated by all of the sensors and scanners we have today. All of this data is useless unless taken in context with other sparse data. Each strand of data may only be a few kilobytes in size but when put together with other sensors readings, they can create a much fuller picture. Applications are needed to not only enable devices to talk with others using M2M, but also to collect all the data and make sense of it. 

                The future of sparse data could even include what some consider Thin Data. Thin data could include simple sensors and threshold monitors built into the furniture and ancillary office equipment. When viewing all the sensors on the floor over time it might show the impact of changing temperature in the space, or moving the coffee machine. You could look at the actual usage data of fixtures like doors and lavatories. There is a huge potential for inferential data mining. And to even take thin data to the next level, include reproducing nano technology that is embedded in plant seeds. The nana agent would become part of the plant and relay state information as the plant grows. This would allow massive crop harvesters to know if and when the plants are in distress. Other areas of interest for thin data include monitoring traffic on bridges and roadways, or in a variety of weather monitors or tsunami prediction systems.

                Machina Research, a trade group for mobile device makers, predicts that within the next eight years, the number of connected devices using M2M will top 50 billion worldwide.  The connected-device population will include everything from power and gas meters that automatically report usage data, to wearable heart monitors that automatically tell a doctor when a patient needs to come in for a checkup, to traffic monitors and cars that will by 2014 automatically report their position and condition to authorities in the event of an accident. One of the most popular M2M setups has been to create a central hub that can be used by wireless and wired signals. The sensors in the field would record an event of significance, be it a temperature change, inventory leaving a specific area or even doors opening. The central hub would then send that information to a central location where an operator might turn down the AC, order more toner cartridges or tell security about suspicious activity. The future model for M2M, would eliminate the central hub or human interaction. The devices would communicate with each other and work out the problems on their own. This smart technology would decrease the logistics downtime associated with replacing an ink cartridge on a printer. Once the toner reached a low threshold, the printer would send a request/acquisition to the toner supplier and a replacement would immediately be shipped. Once the toner was received, it could be replaced. This turn-around time would be drastically better than having the printer fail because of low toner levels, then ordering it, having to wait on shipping, and then replacing the toner. 

                Humans won’t be completely removed from the equation. They will still need to be in the chain to oversee the different processes, but they will be more of a second pair of eyes and less of a direct supervisor. Humans will let the machines do they work, and will only get involved when the machine reports a problem, like a communications failure. More Applications software development will be needed in the future to connect those 50 billion devices. Another location to learn more about M2M development is the Eclipse Foundation.

Kayak.com Travel Guide Using Big Data



    
Airfare Huntsville, AL to Las Vegas, NV in September 2013
     Kayak.com is a travel site that allows users to book hotels, flights, rental cars, cruises, as well as all inclusive vacation packages.  There have always been guidelines when it comes to travel prices in regard to which months are best to travel.  However, Kayak recently analyzed over one billion search queries involving airfare.  Instead of a general guide that specifies the cheapest months to travel domestic and abroad, this analysis gives insight into specific popular destinations worldwide. 
     Some interesting results show an annual increase in airfare for the “popular destinations” except for Toronto.  The data also showed which cities were the most popular for the year, number one being Vegas.  This analysis was able to establish a top three cheapest months for domestic travel which consist of September, January, and October in ascending order.  February and March make up the best months to fly overseas with regard to price while January and February are the least busy months. 
     Some other trends were also prevalent in the data.  Many locations that are not typically considered “popular destinations” were analyzed due to an increase in popularity.  Some of the most helpful results show which of these rising destinations maintained the same ticket price and which increased.  The only domestic location that both rose in popularity and remained the same in ticket price was Nashville, Tennessee.  There were also results for those locations that have decreased in popularity yet increased in price. 
     Many times, people know where they wish to travel or at least have a pretty good idea.  However, this data could aid in making the decision of where or when to go.  If the date is already set, then Kayak can help the customer find a popular destination for a good price.  If the customer chooses to study the results, they may wish to visit Las Vegas, for instance, due to it being ranked number one in popularity and number five in lowest price among “popular destinations.”  One important thing to note, however, is that these results are based on search queries rather than booked flights.  These results could be skewed by would be travelers simply being curious about the cost of a flight to a particular popular location.  

Source:  http://venturebeat.com/2013/03/21/kayak-analyzes-a-billion-queries-to-uncover-secrets-behind-cheap-flights/
 

What enterprises falling at Big Data management?


We have learned many successful cases implementing Big Data from different industries. Bloggers and experts keep talking why Big Data is good. Some companies did have wild success reporting, analyzing, and predicting their massive dataset in terms of Big Data, but there are thousands of companies struggling through the process of big data.  A new survey, From Overload to Impact: An Industry Scorecard on Big Data Business Challenges, sponsored by Oracle, asked 333 C-level executives in North America by phone or online about how they're handling the "data deluge" and how well they're able to extract business intelligence from it to "improve operations, capitalize on new opportunities, and drive new revenue." The score is from A to F, A represents best and F represents Worse. Since there are so many successful cases, I think the result should be pretty good. The answer is not. 29% of executives gave their organization a “D” or “F” in preparedness to manage the huge data flow, and 31% of executives gave a “C”. 38% of executives said they don’t have the right systems.

The most important problems is many businesses believe that they can purchase a software which can directly point out the problems and give the solution. Few days later, they even can have shiny reports and graphs from the software. There is another story.  A CIO ordered his staff to acquire hundreds of servers with the most capacity available. He wanted to proclaim to the world – and on his resume – that his company built the largest Hadoop cluster on the planet.  And that’s it. He doesn't have any plan to deal with his servers and data. He left after 2 years with no business case or Big Data business value.

However, Big data needs to be thought about from the highest levels to the lowest level, and starting with the question: “‘How can we better utilize the data we have to make better business decisions?’ This is very important because if these top managers don’t understand what Big Data can do, they will fall into the hell.


 References:


Thursday, March 28, 2013

Splunk as a Big Data Platform for Developers






Damien Dallimore who is the developer evangelist at Splunk presents this video, and the presentation is about Splunk which is a Big Data platform for developers. In this video, you will see the overview of the Splunk platform, how to use Splunk, Splunk JAVA SDK, the conference integration Splunk extensions, and some other JVM/JAVA related tools.

Splunk is an engine for machine data for aggregating, collecting and correlating. In the same time, Splunk provides visibility, reports and searches across IT systems and infrastructure, and it will not lock you into a fixed schema. You can download Splunk and install it in five minuetes and run on all modern platforms. In addition, Splunk has an open and extensible architecture. It can index any machine data, such as capture events from logs in real time, run scripts to gather system metrics and connect to APIs and databases, listen to syslog, raw TCP/UDP and gather windows events, universally indexes any data format so it doesn’t need adapters, stream in data directly from you application code, and decode binary data and feed in. Splunk can centralize data across the environment, firstly Splunk Universal Forwarder sends data to Splunk Indexer from remote systems, secondly, it uses minimal system resources, easy to install and deploy, finally, it delivers secure, distributed, real-time universal data collection for tens of thousands of endpoints. Splunk scales to TBs/day and thousands of users, automatic loads balancing linearly scales indexing, and distributes search and MapReduce linearly scales search and reporting. Splunk provides strong machine data governance, it provides comprehensive controls for data security, retention and integrity, and singles sign-on integration enables pass-through authentication of user credentials. Splunk is an implementation of the Map Reduce algorithmic approach and it is not Apache Hadoop MapReduce (MR) the product. Splunk is not agnostic of its underlying data source and is optimal for time series based data. Splunk is end-to-end integrated Big Data solution and is fine grained protection of access and data using role based permissions. Splunk is data retention and aging controls, when users use Splunk, they can submit “Map Reduce” jobs without needing to know how to code a job. Splunk has four primary functions, firstly, searching and reporting, secondly, indexing and search services, thirdly, local and distributed management, finally, data collection and forwarding. The developers could use Splunk to accelerate development and testing, to integrate data from Splunk into your existing IT environment for operational visibility, and to build custom solutions to deliver real-time business insights from Big Data. In a conclusion, Spunk is an integrated, enterprise-ready Big Data Platform.

Mining data for discovery of high productivity process characteristics.



Data driven approach has been widely used for studying the trend of customer or marked behavior in industrial sectors, finance, retail and services. Recently mining data warehouse has caught up attention in biotechnological sector because of rapid expansion of genomics based data. Increase in biologics manufacturing also present an area of data mining that is yet to be explored.
Today’s manufacturing facilities are advanced, highly automated in their operation and data acquisition. Thousands of process parameters are constantly acquired and stored electronically. Fluctuations in the process productivity and product quality invariably occur in the process of productivity. Understanding the root cause of these abnormalities and increasing the process robustness will have major economic implications for the product. Mining bio – process data to identify parameters which may cause process fluctuations possesses lot of potential for enhancing the productivity and process efficiency.
Many techniques to explore bio – process data are employed from past studies. Principal component analysis (PCA), partial least squares (PLS) and unsupervised clustering have been proposed to analyze and monitor bio – processes. A decision tree based classification approach was proposed to identify the process trends that best differentiate runs with the high and low productivity. Artificial neural network (ANN) is also a popular tool used to model the non – linear interactions on the temporal process data. Despite these attempts mining huge volumes of production scale process data and on line implementation of such schemes remain tedious.
Bio process data sets are unique the frequency of measurement varies with respect to the parameters. In addition to temporal measurements of viability, cell densities, consumption and production rates of nutrients and metabolites large amount of process parameters are commonly recorded. The complexities associated with the vast and unique characteristics of bio process data present substantial challenges as well as opportunities for the data mining process. The data mining steps involves application of descriptive and predictive pattern recognition methods to discover significant changes in the data. Identified models can be interpreted by process experts to gain further insights for process improvement.
Support vector machines (SVM) are class of predictive machine learning algorithms which run on Vapnik Chervonekis theory based on structural risk minimization (SRM).  Support vector machines identify a linear decision boundary that separates objects form the 2 classes with maximum distance called margin. The object is described by set of features, non – linear support vendor machines can be constructed my kernel transformation functions.
This model-based data mining is an important step forward in establishing a process data driven knowledge discovery in bio - processes. Implementation of this methodology on the manufacturing floor can facilitate a real time decision making process and hence improve the robustness of large scale bio processes.
Reference - Mining manufacturing data for discovery of high productivity process characteristics.