Sunday, March 31, 2013

Using Data Mining Techniques to Predict the Survival Rate For Heart Transplants

Last week I posted a blog via which I introduced my research problem and gave some statistics about the current donated hearts and the gap.
This week I will continue to share my research with my classmates.
As I mentioned we focus on predicting how many years can a specific person live with the donated new heart.
In order to solve this problem , the first and the main problem is to determine the factors ( variables) affecting the result.

Conventionally, researchers have been dealing with small sets of dataset with using conventional statistical techniques which does not take collinearity and the nonlinearity into account, as it was discusses in the previous blog. They use some non-parametrical and non-statistical techniques that are computationally expensive and need prior knowledge about the data .

The biggest advantage of todays world is there is a flood of big data in the health informatics that can be dealt with data mining techniques, which reveal better and more accurate solutions for the survival of organ transplant recipients than any of the conventional methods used by previous studies.

We had started to do the research by obtaining a very large dataset from UNOS, which is a tax-exempt, medical, scientific, and educational organization that operates the national Organ Procurement and Transplantation Network.The obtained dataset has 443 variables and 43000 cases which belong to the Heart Transplant Operations. These variables include the socio-demographic and health-related factors of both the donor and the recipients. There are also procedure-related factors among the dataset.

After preprocessing the data ( cleaning, dealing with the missing values, reorganizing the data for the specific studies etc), we used variable selection methods in order to determine the potential predictive factors.

These potential predictive factors are the ones which have questioned whether they are predictive or not by using some data mining algorithms such as Support Vector Machines, Decision Trees and Artificial Neural Networks.

After doing cross-tabulation and doing sensitivity analysis , we observed that all of these three methods gave pretty satisfactory results.

For 3 Years survival study, Support Vector Machines gave the best prediction rate by predicting 94.43 % of the cases correctly, while artificial Neural Network 81.18 % and Decision Trees 77.65 % of them correctly.

What do these results mean ?

For support vector machine, the accuracy rate is 94.43 % , which means if Support Vector Machine is telling us that a specific person will live or die if he/she gets the donated organ , it is 94.43 % correct.But it has 6.57 % of chance to fail to predict.

It also lets us know which factors are playing a role to predict these results.

These results are pretty high results which have not been reached by using the conventional statistical techniques which is pretty promising for the future success of the heart transplants in the future.

The Mental Approach to Baseball Hitting-Big Data to Analyze Hitters Brain Function

Considered one of the most difficult tasks in sports, hitting a thrown baseball, especially at the professional level is something only the most gifted athletes on the planet can do. The issue is that performing this task requires a complex interaction between the brain and muscles in the body. Even the most physically gifted athletes are unable to hit a baseball if their mental prowess is subpar. According to this article, http://baseballanalysts.com/archives/2009/09/unraveling_the.php, a professional-level batter has approximately 50 msec (.05 sec) to react after a pitch is thrown in order to hit it. After that .05 sec, the batter is not able to alter his swing in any way from what he has decided to do. For comparison purposes, the average human eye blinks between 300 and 400 msec. This means the batter must decide whether to swing or not anywhere from 6 to 8 times faster than someone can blink, not an easy task. Making things even more difficult is that most pitchers throw three or four different pitches, many of which move in the air. So now, the batter must identify the type of pitch, decide whether it is a ball or strike and send electrical signals to their muscles to react in time in order to successfully hit it. No wonder failing 7 times out of 10 is considered an elite level of hitting performance (a .300 average). The paper at this link from the 2013 Sloan Sports Conference details research done on the subject of batter brain function in determining pitches: http://www.sloansportsconference.com/wp-content/uploads/2013/02/A-System-for-Measuring-the-Neural-Correlates-of-Baseball-Pitch-Recognition-and-Its-Potential-Use-in-Scouting-and-Player-Development.pdf

The study was done using three Division 1 college baseball players. Each player looked at 468 simulated pitches and was asked to identify the pitch type using a keyboard as soon as the simulated pitch was thrown. An fMRI and EEG scanner were used to study the subjects’ brain activity while they were identifying pitches. A linear equation was formulated to try and determine which independent variables were related to the time it took to recognize a pitch and whether it was correctly identified or not. The brain scans were used to evaluate brain activity in different areas as time passed after the pitch was “thrown”. The studies found that for all pitch types, brain activity peaked around 400 msec and 900 msec. As a reference, 400 msec would be approximately the time the pitch would cross the plate at a normal pitch speed, while the researchers speculated the second peak was a type of post-decision thinking about their choice. This study found that different regions of the brain are active for different pitch types and different regions are active for incorrect vs. correct pitch identification within a pitch type group (an example being one area is active for a correct fastball ID while another is active for incorrect fastball ID).

Some application of this research may lie in the scouting future of baseball. Players are long coveted who have the physical tools necessary to hit a baseball, but perhaps this research could better enable scouts to find out who has the mental abilities to identify pitches properly, which is key in hitting the ball. The researchers also hypothesize the information could be used in scouting reports by being able to identify what pitches batters are bad at recognizing. If a team knows their hitter is bad at recognizing curveballs mentally, they could work with them to try and correct that. Conversely, if a pitcher knows a batter has difficulty recognizing a pitch he could try to use that particular pitch more often or in a particular situation to try and get them out. The amount of data which could be generated by this research is vast and untested, but it could have an important impact on how the value of baseball players is determined in future.

Saturday, March 30, 2013

More on the Motion Chart

I posted a motion chart regarding percent GDP spend on military and GDP per capita. While I discussed a little of what caught my eye at first glance on the visualization, there is more that I wanted to briefly address. I removed a large portion of the countries in order to clean it up a little. I kept all European and North American nations, Japan, New Zealand, Australia, and South Korea. Two topics I want to discuss about these nations are the economic differences between eastern and western European nations, and the similarities in the economies of Canada, Australia, New Zealand, and South Korea.

Below is the image of the chart described above. You can see the western European (represented by blue), eastern European (yellow-green), North American (light green), Australia and New Zealand (yellow), and Japan and South Korea (red). As you can tell, with the exception of the United States, that Australia, New Zealand, Japan, and Canada all not only lie within the western European nations, but have almost identical reactions. So much so that if indicated by the same color, would be indistinguishable. This can be attributed to how closely their economies rely on the same variables. Next I am going to try and create a similar chart going back decades to observe how much slower nations react to economic changes on different continents.

When watching the motion chart, the differences between eastern (represented by blue in the image below) and western European nations (represented by red), in terms of GDP per capita, are astonishing. While it is probably no surprise that there is a difference, I know that I would not have expected as defined clusters as the two seen throughout the sixteen years of data. What causes this difference? Much can be attributed to decades of communism in the eastern nations. While they have moved from this form of economy, the effects are obviously still seen today.

It would be expected that over the next few decades that these eastern European nations will begin to migrate up the chart and join those the western Europe.

To see the motion chart of 130 nations see my visualization from last week.

How Google Search fights Spam

In class we discussed on how Google's search program works and how it was better than its predecessors due to its ability to find the most relevant web pages based on what you were searching for. But as always, there are people who are going to try and cheat the system. These people are referred to as spammers, which are people who try to get their unrelated website to come up in any search usually in order to try and push some product on the user.

There are three many ways spammers try to beat the search engine.

1. Cloaking- We talked about this in class. This is the practice of putting the searched for word in the same color as the background thereby hiding it from the user on the site, but still having a numerous amount of the searched word that the search engine will read and therefore think the site is relevant.

2. Keyword Stuffing- This is similar to cloaking. This is when a website plasters a numerous amount of the keyword on the website, usually at the bottom of the page, in order for it to get the search engine to believe it is relevant to the search.

3. Paid Links- This is when a website pays other websites to link to its page in order to increase its PageRank, which we discussed in class is how Google works by finding the importance of the webpage based on the "votes" by links on other webpages.

The paid links are a little harder to discover, but usually if a site has been selling links Google will no longer trust the links from that page.

Source:

http://www.google.com/competition/howgooglesearchworks.html#section4

smart grid and data mining (2)

Speaking of smart grid, power engineers and energy companies are pretty excited, not only because it will bring in a new technique revolution but also it is going to involve a huge number of money--- Only about $356 million of it today, but potentially $4.2 billion of it by 2015. Which is meaning that it will reach a cumulative $11.3 billion between 2011 to 2015. That’s what Pike Research predicts for the global market for smart grid data analytics, or software and services that can mine data and provide intelligence for smart grid vendors, utilities and consumers.

As a result, most utilities around the world have to face up to new problems: how to deal with a flood of smart grid data in the upcoming years, they will also need to mine that data to find ways to cut costs, improve customer adoption and better predict future power needs. In a sense, how well the utilities deal with the above challenges will affect the destiny of the whole smart grid industry.

There is no doubt about that applying the smart algorithms and applications of the Internet industry to the smart grid could generate a host of new ways of doing business. On the utility operations side, smart meters and distribution automation systems are supposed to be data-mined to optimize the flow of power or predict when equipment is most likely to fail. On the customer end, behavioral data and market analysis can also be applied to entice more and more people into energy efficiency programs, or help them choose which energy-efficient appliances to buy.

A host of IT giants are already involved in smart grid data analytics, including Accenture, Capgemini, HP, IBM, Microsoft, Oracle, SAIC, SAP and Siemens among them. Smaller, newer entrants include OPOWER, OSIsoft, Telvent, Ecologic Analyticsand eMeter.

Utilities are also seeking help in upgrading customer relationship management to handle the shift from monthly power bills to daily or hourly interactions via smart meters. As for concerns over home energy data security and privacy, Pike predicts that smart grid IT players such as Cisco, IBM, Microsoft and Oracle will play an important role.

reference:

http://gigaom.com/2011/08/15/big-data-meets-the-smart-grid/

Sports fraud detection using data mining techniques

People are loving sports in the states. Also people hate fraud activities in sports area. It is believed that too many sports frauds will lead to losing popularity and audiences, and eventually destroy the whole sport. However, for some reasons, fraud in sports has always been existing in reality. Therefore, it is essential for people to find out ways to detect sports fraud activities.

Before introducing sports fraud detection, I’d like talk a little bit about sports fraud. According to people’s observations, there are mainly three categories of fraudulent activity in sports: poor player performance, a pattern of unusual calls from the referee, and lopsided wagering. For the previous two, basically they are trying to manipulate the betting line. There are several cases have been reported by media. A recent example of this was in the summer of 2007 when NBA referee Tim Donaghy was investigated and convicted for compromising basketball games to pay off gambling debts. Lopsided wagering can be used as an indicator of a compromised game. This type of wagering could involve betting in excess of what is normally expected or betting heavily against the favorite.

As you might know, Las Vegas Sports Consultants Inc (LVSC), which sets betting lines for 90% of Las Vegas casinos, is one of the organizations that actively looks for fraudulent sports activity. What The LVSC statistically analyzes are both betting lines and player performance in order to look for any unusual activity. Player performance is judged on a letter-grade scale (i.e., A-F) and takes into account variables such as weather, luck and player health. Taken together, games are rated on a 15 point scale of seriousness. A game rated at 4 or 5 points may undergo an in-house review, 8-9 point games will involve contact with the responsible league. Leagues are similarly eager to use the services of LVSC to maintain game honesty. The LVSC counts several NCAA conferences, the NBA, NFL, and NHL, as some of its clients.

As practice, Las Vegas Sport Consultants are not the only gambling institutions with an interest in honest and fair sports events, offshore betting operations are starting to fill this role as well. One popular offshore gambling site, Betfair.com, has signed an agreement with the Union of European Football Associations (UEFA) to help monitor games for match-fixing, unusual results, or suspicious activity.

Big Data and Film

After cloud computing changes people’s mind in business, the Big Data comes out and changes business operation and analysis method to increase the efficiency and mobility. Lots of different industries get benefits from Big Data, and the entertainment business also takes Big Data into operation, such as analysts can use the amount of nominations, wins in different awards shows, and information from betting websites data to do some researches with using Big Data to predict the major award winners in some specific award ceremony; music executives through analyzing data to acquire the customers’ listening habits and musical inclinations, and they also can use data analysis to decide the position and the kinds of musicians to give their shows; and movie studios and distributors or other movie organizations can use Big Data to analyze data and then to make decisions about the promotion methods, releasing film dates and releasing locations to make sure that the films could make profits.

http://www.thinkgig.com/entertaining-change-big-data-and-cloud-transform-film-industry/

Now we will see ten movies that could help us to get familiar with Big Data through an interesting way.

V for Vendetta (2005 - James McTeigue) Because of V-Trinity, which stands for Velocity, Volume and Variety, V means more than Vendetta in Big Data, in addition, Velocity means real time processing, Volume means getting useful and enough information through a large amount of data, and Variety means to use and to relate different kinds of data to help make the final decisions.

The Fast and the Furious (2001- Rob Cohen) Keep your eyes on your data and make them to “speak out” useful information. You could use Big Data to analyze your business and predict which decision could make a difference and which decision could make you bankrupt, and you could imagine that you exploit data to get knowledge whether to be fail or to be successful in a high speed, with the data driven culture, you could have the idea of dying soon or triumphing soon. Fail fast or you are going to get furious!

The Gold Rush (1925 - Charlie Chaplin) The data is like the new mine, and a large number of companies want to exploit the potential of the data to get more and more information to help them make money in business, but it is very difficult to build up the data driven culture in the organizations which is like Chaplin’s hard trip in Alaska. If the organizations want to have an easier and happier way to take the revolution, they should know how to avoid the fatal mistakes in Big Data revolution.

Up (2009 – Pete Docter, Bob Peterson) Up is a very touching and good movie and you could get lots of fun in the clouds scene, additionally, the volume of Big Data is like the elastic cloud infrastructure. The Big Data with the map-reduce paradigm could make the information technology problems solve on different cloud infrastructures.

The Elephant Man (1980 – David Lynch) “Hadoop” (named after the toy of Doug Cutting’s son, the creator) is a yellow elephant in the Big Data room. In the beginning, Hadoop is just a Google project and now it is a corner stone in the big data foundations.

Titanic (1997 – James Cameron) Titanic is a movie which shows a decision without enough analyzing the uncertainty to come out a sad result. Big Data could make you get more useful information which you could not see in your data before. With Big Data, you could see the “Iceberg” under the “water” and you will have a new view about your data and your decisions.

Minority Report (2002 – Steven Spielberg) Pre-crime department Anderson stops all the criminals before killing victims by predicting the future. Predictive analytics to predict what will happen in the future through analyzing data is the “killer application” of Big Data.

No Country for Old Men (2007 – E. & J. Coen) Big Data is a brand new skill and the old database “men” should make themselves to adjust to the new technology. In addition, Big Data means a lot of kinds of data and a large number of sources to collect data to make the decision makers find out the potential from “Big” data.

Big Fish (2003 – Tim Burton) Big fish is a movie which has a relationship with Auburn University. Big Data is so far like Ed Bloom’s shocking story which is told from his deathbed, sometimes, you could not make a clear decision about which is reality and which is a fairy tale with the reason that the fair tale may become the reality in the future.

Black Swan ( 2010 - Darren Aronofsky) A Black Swan is not only a role but also a theory to tell us that the rare events are hard to predict and the rare events may have a significant influence on the final results. Big Data could help you to see the rare events from the data, additionally, with analyzing data, you could predict the “Black Swan” and have a clear idea about how to do in the future.

Source:

1.http://www.thinkgig.com/entertaining-change-big-data-and-cloud-transform-film-industry/

2. http://bigdata-doctor.com/big-data-explained-in-10-movies/

Tutorial: Naive Bayes Classification Algorithm

Naive Bayes is a simple classification algorithm. Naive Bayes algorithm is based on the calculation effects of each criterion to results as probability. All of the potential datasets and newly added documents are used to calculate the possibility of each newly added term to affect the categories (Adsiz, 2006). By Bayes Theorem (Han & Kamber, 2001),

P(cj│d)=(P(cj)P(d│cj ))/(P(d)) ,

where P(cj) is the prior probability of category cj and d=(t1,…,tM ) is set of documents that is going to be classified.

Because of there are many features in datasets, it will be so difficult to calculate P(d│cj ). Therefore, in this algorithm, features in a document are considered as independent (Adsiz, 2006).

P(d│cj )=∏P(ti│cj ) , i=1,…,M ,

where ti is the i^th term in the document.

P(cj│d) is the probability of a document being in category cj

P(cj│d)=P(cj )∏P(ti│cj ) , i=1,…,M ,

where P(ti│cj ) is the conditional probability of term ti occurring in category cj. Also, it is an indicator of how much ti contributes to the category. In this method, we try to find the most appropriate category for the document. For this purpose, the best approach is the maximum a posteriori (map) category cj_map (Manning, Raghavan & Schütze, 2008).

cj_map=[argmax_cj ] P(cj│d)=[argmax_cj )] P(cj│d)=P(cj )∏P(ti│cj )

We do not know the exact values of parameters P(cj ) and P(ti│cj ). However, by using the maximum likelihood estimate (MLE) theorem, we can make estimations about these parameters. Let P ̃(cj ) be the approximate calculation of P(cj), and P ̃(ti│cj ) be the approximate calculation of P(ti│cj ).

P ̃(cj )=Nj/N ,

where N is the number of all documents, and N_j is the number of documents in category cj. And,

P ̃(ti│cj )=(1+Nij)/(M+∑ Nij), i=1,…,MN ,

where Nij is the number of documents belonging to category cj and including the i^th term ti.

1- Adsiz, A., (Ahmet Yesevi University ). (2006). Dissertation: Text Mining.

2- Han, J. & Kamber, M. (2001). Data Mining. Morgan Kaufmann Publishers, San Francisco, CA

3- Manning, C.D. , Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.

Tutorial- Topsy

There are plenty of tools out there to analyze social media topics and trends. One that Chris and I have found especially helpful in analyzing these large amounts of data is called “Topsy”. He is probably better at understanding all of the cool things it can do, but I will give a short tutorial on the basic information it can provide.

**To be able to try this tutorial, you will need to create a trial account. It is free and it lasts for two weeks, so you’ll have plenty of time to be able to play around with it.

Once you have created your account and have logged in, you will see this as the main screen:

The first thing you will want to do is type in the terms that you want to search for in the bar at the top. After each keyword or phrase, hit enter. In this example, I will search three phrases: “data analysis”, “big data”, and “big data analytics”. To compare the three terms, I need to be sure that the check box beside each phrase (under the search bar) is checked. You will then be looking at your Dashboard. This feature gives an overview of the information Topsy has collected. The timeline shown is based on the last seven days, but you can choose a specific date range if you would like. In this case, my Dashboard looks like this:

On the Dashboard, you are able to see Tweet activity over time. It is easy to see that the phrase “big data” is a lot more prevalent on Twitter than the other two phrases that were searched. You are also presented with Top Tweets, Top Links, and Top Media.

If you click on the Geography tab at the top, you are able to see where the Tweets are coming from. Topsy is gathering most of its Tweets (at least about these topics) from Twitter users in the US. There are almost 25,000 tweets from the US, and the next closest is the UK with only about 4,000 tweets. You can see the breakdown of the tweets from around the world below:

You can even click on the United States link, and Topsy breaks down the tweets by state. Pretty amazing! Alabama doesn’t have much to say about these topics. There are only 43 tweets with these phrases from Alabama, whereas California has over 3,000. Big data must be a hot topic in Silicon Valley!

Now I am going to go back to the Dasboard. Right around March 29, I see a huge peak in the frequency of the term “big data”. Let’s find out why!

If you click on this peak, it leads me to the Activity tab, where I am able to see a list of the Top Tweets for March 29,I 23:00 about “big data”. It looks like this:

I am able to click on any links that may be connected to those tweets to read more about what was so popular about “big data” on this day. One of the main tweets that seems to have gotten the most action at this time was about how Doctors can use big data to improve cancer treatments. Since a link is attached to this tweet, I am able to check it out myself!

Like I said, Topsy is a great tool for analyzing the huge amounts of data found in the social media world. These are just a couple of the things that the program can do. I encourage you to check it out and find out more things that this tool can be used for!

PS. I focused mainly on Twitter data in this example, but it is my understanding that you can search Facebook, Tumblr, and Pinterest as well.

Friday, March 29, 2013

k-Means Clustering Tutorial in RapidMiner

In this tutorial, I will attempt to demonstrate how to use the k-Means clustering method in RapidMiner. The dataset I am using is contained in the Zip_Jobs folder (contains multiple files) used for our March 5^th Big Data lecture.

Save the files you want to use in a folder on your computer.
Open RapidMiner and click “New Process”. On the left hand pane of your screen, there should be a tab that says "Operators"- this is where you can search and find all of the operators for RapidMiner and its extensions. By searching the Operators tab for "process documents", you should get an output like this (you can double click on the images below to enlarge them):

You should see several Process Documents operators, but the one that I will use for this tutorial is the “Process Documents from Files” operator because it allows you to generate word vectors stored in multiple files. Drag this operator into the Main Process frame.

Click “Edit List” beside the “text directories” label in the right-hand pane in order to choose the files that you wish to run the clustering algorithm on.

You can choose whatever name you wish to name your directory.

Click the folder icon to select the folder that contains your data files. Click “Apply Changes”.

Double click the “Process Documents from Files” operator to get inside the operator. This is where you will link operators together to take the (in my case) html documents and split them down into their word components (please note that you can run the K-Means Clustering algorithm with a different type of file). As highlighted in my previous tutorial, there are several operators designed specifically to break down text documents. Before you get to that point, you need to strip the html code out of the documents in order to get to their word components. Insert the “Extract Content” operator into the Main Process frame by searching for it in the Operators tab.

The next thing that you would want to do to your files is to tokenize it. Tokenization creates a "bag of words" that are contained in your documents. Search for the "Tokenize" operator and drag it into the "Process Documents from Files" process after the “Extract Content” operator. The only other operator that is necessary to include for appropriate clustering for documents is the “Transform Cases” operator; without this, documents that have the same words in different cases would not be considered as more distant (less similar) documents. You should get a process similar to this:

Now for the clustering! Click out of the “Process Documents from Files” process. Search for “Clustering” in the Operators Tab:

As you can see, there are several clustering operators and most of them work about the same. For this tutorial, I chose to demonstrate K-Means clustering since that is the clustering type that we have discussed most in class. In RapidMiner, you have the option to choose three different variants of the K-Means clustering operator. The first one is the standard K-Means, in which similarity between objects is based on a measure of the distance between them. The K-Means (Kernel) operator uses kernels to estimate the distance between objects and clusters. The k-Means (fast) operator uses the Triangle Inequality to accelerate the k-Means algorithm. For this example, use the standard k-Means algorithm by dragging into the Main Process frame after the “Process Documents from Files” operator. I set the k value equal to 4 (since I have 19 files, this should give me roughly 5 files in each cluster) and max runs to about 20.

Connect the output nodes from the “Clustering” operator to the res nodes of the Main Process frame. Click the “Play” button at the top to run. Your ExampleSet output should look like this:

By clicking the folder view under the “Cluster Model” of the output, you can see which documents got placed into each cluster.

If you do not get this output, make sure that all of your nodes are connected correctly and also to the right type. Some errors are because your output at one node does not match the type expected at the input of the next node of an operator. If you are still having trouble, please comment or check out the Rapid-i support forum.

Tutorial: Web Scraping on Google Spreatsheet

Web scraping is a very useful technique to collect information from different url in the same webpage.

Web crawling in Rapid miner cannot process all kinds of rules so that I use google spreadsheet to make it easier to collect information and then import to Rapid miner to do the next step. Here is a tutorial made by me.

Tutorial: Web Scraping on Google Spreatsheet

Tutorial 4-How to Create a Polyviz widget using Orange

Polyviz widget in Orange

Polyviz is a visualization technique used in Orange where the various data points are related to anchors with value dependent positions. A comparison of various anchors or attributed can be made visually showing the comparison of data points by pinpointing each data point with respect to its attribute. It can be applied in electoral analysis, analyzing spread of epidemics, sales distribution of goods and so on. In the tutorial the data set shows the different age groups and their type of prescription lens, whether they are astigmatic (caused by irregular shape of cornea) tear rate of the lens and finally type of lenses used. If they are astigmatic a specific type of lens known as toric lenses must be used. I have used screen shots to develop this tutorial as my previous tutorial shows how to bring in the dataset into Orange and extend it to the Data table and bring widgets into the scheme in Orange. This is on similar lines but however uses the Polyviz widget under the Visualize category. The first picture is the of dataset used generated by the Data Table widget. It shows how the data is categorized (the picture shows only the first few lines of the data set and is followed on similar lines).

The second picture shows all my widgets used in the scheme of this project. The scatterplot, and distributions are only used for inference, however the polyviz widget can be built without any of these widgets as well.

Once I give the data signal as input to the Polyviz widget, it allows us to visualize the data in many interesting ways. It assigns characteristics to each side of a polygon and automatically creates a scale and plots the data points. As shown in the picture below age, astigmatism and tear rate are the attributes compared with the type of lens color coded for easy comprehension. Various combinations such as *young and not astigmatic* *myope and who is in the age bar pre-presbyopic* and so on can be analyzed with respect to lenses. The data points can be accesed by clicking on them and from the polyviz widget they can visualized as per the format we would like to subject to availability on the Orange widget tab.

Various combinations can be visualized by adding and removing the different parameters in the dialog box on the right hand side of the polyviz window. Polygons of any different sides can be developed by adding or removing parameters.

The intuitions gained form the Polyviz widget can be tested using various other widgets in Orange. Interesting correlations can be made with the given data set using the Polyviz widget.

M2M - Future of Code

Machine-to-Machine Technology

How far can big data go? What is next for big data analytics? According to GCN, the next horizon for big data may be machine-to-machine (M2M) technology. As coding of big data advances, Oracle is now considering big data “an ecosystem of solutions” that will incorporate embedded devices to do real-time analysis of events and information coming in from the “Internet of Things,” according to the Dr. Dobbs website. There is so much data that is being generated by all of the sensors and scanners we have today. All of this data is useless unless taken in context with other sparse data. Each strand of data may only be a few kilobytes in size but when put together with other sensors readings, they can create a much fuller picture. Applications are needed to not only enable devices to talk with others using M2M, but also to collect all the data and make sense of it.

The future of sparse data could even include what some consider Thin Data. Thin data could include simple sensors and threshold monitors built into the furniture and ancillary office equipment. When viewing all the sensors on the floor over time it might show the impact of changing temperature in the space, or moving the coffee machine. You could look at the actual usage data of fixtures like doors and lavatories. There is a huge potential for inferential data mining. And to even take thin data to the next level, include reproducing nano technology that is embedded in plant seeds. The nana agent would become part of the plant and relay state information as the plant grows. This would allow massive crop harvesters to know if and when the plants are in distress. Other areas of interest for thin data include monitoring traffic on bridges and roadways, or in a variety of weather monitors or tsunami prediction systems.

Machina Research, a trade group for mobile device makers, predicts that within the next eight years, the number of connected devices using M2M will top 50 billion worldwide. The connected-device population will include everything from power and gas meters that automatically report usage data, to wearable heart monitors that automatically tell a doctor when a patient needs to come in for a checkup, to traffic monitors and cars that will by 2014 automatically report their position and condition to authorities in the event of an accident. One of the most popular M2M setups has been to create a central hub that can be used by wireless and wired signals. The sensors in the field would record an event of significance, be it a temperature change, inventory leaving a specific area or even doors opening. The central hub would then send that information to a central location where an operator might turn down the AC, order more toner cartridges or tell security about suspicious activity. The future model for M2M, would eliminate the central hub or human interaction. The devices would communicate with each other and work out the problems on their own. This smart technology would decrease the logistics downtime associated with replacing an ink cartridge on a printer. Once the toner reached a low threshold, the printer would send a request/acquisition to the toner supplier and a replacement would immediately be shipped. Once the toner was received, it could be replaced. This turn-around time would be drastically better than having the printer fail because of low toner levels, then ordering it, having to wait on shipping, and then replacing the toner.

Humans won’t be completely removed from the equation. They will still need to be in the chain to oversee the different processes, but they will be more of a second pair of eyes and less of a direct supervisor. Humans will let the machines do they work, and will only get involved when the machine reports a problem, like a communications failure. More Applications software development will be needed in the future to connect those 50 billion devices. Another location to learn more about M2M development is the Eclipse Foundation.