Analytics and Visualization of Big Data

Friday, April 5, 2013

Use OpenHeatMap to introduce the world to your data.

I recently came across a great visualization tool similar to the Heat-maps produced in Google docs. This program is called OpenHeatMap. It is a great visualization tool that was created by Pete Warden a former employee of Apple Inc. It is very easy to use open source software that is free to use. If you would like to try out this software please go to the following website:

http://www.openheatmap.com/

There are a few examples on the website of how the program works. The one pictured below was created by Pete Warden and shows how the political views of each state have changed over time. This created a video of the data starting in 1868 and progresses through the political standpoint of each state up through recent history. It shows how much the political views have changed across the nation which I found to be very interesting. I have included a picture of what the start of this video looks like but if you would like to see the actual video please click this link: OpenHeatMap

Another really interesting map already on there shows the unemployment rates for each county across the USA. This map shows several other features of the program which allows you to zoom out to see the entire country such as the top picture below. The next photo shows a slightly zoomed in map to show the state of Alabama. I have placed the mouse clicker over Madison County and as you can see at the bottom it displays the unemployment rate of 2.7%. This type of map could be extremely useful in representing nationwide survey data to see if certain states or even cities had certain needs that other cities might not want or need.

If you would like to check this map out please click the link below and just move your mouse over the area you are interested in.

OpenHeatMap - US unemployment by county

I also found a great website that gives step by step directions on how to use OpenHeatMap. I have provided the link to the instructions below this paragraph. In approximately 60 seconds or less you should be able to set up your own map using your own data. There are several programs out there that offer similar features but very few are as easy to operate as this one.

“HOW TO VISUALIZE YOUR DATA ON A MAP WITH OPENHEATMAP”

http://www.movements.org/how-to/entry/how-to-visualize-your-data-with-openheatmap/

Most of the reviews I read about this program have said that this is a great program but a few complained about it occasionally freezing up but did say that Mr. Warden responded to emails quickly about issues with the program. I personally didn't have any issues with the program while I was using it but it did seem to take a little while to load some of the maps when they included huge amounts of data such as the unemployment rate map. I hope you find this program as useful as I have.

Rules Generation by Partial Decision Trees

There is an alternative approach to rule induction that avoids global optimization but nevertheless produces accurate, compact rule sets. The method combines the divide-and-conquer strategy for decision tree learning with the separate-and-conquer one for rule learning. It adopts the separate-and-conquer strategy in that it builds a rule, removes the instances it covers, and continues creating rules recursively for the remaining instances until none are left.

However, it differs from the standard approach in the way that each rule is created. In essence, to make a single rule a pruned decision tree is built for the current set of instances, the leaf with the largest coverage is made into a rule, and the tree is discarded.

The prospect of repeatedly building decision trees only to discard most of them is not as bizarre as it first seems. Using a pruned tree to obtain a rule instead of pruning a rule incrementally by adding conjunctions one at a time avoids a tendency to over prune, which is a characteristic problem of the basic separate-and-conquer rule learner. Using the separate-and-conquer methodology in conjunction with decision trees adds flexibility and speed. It is indeed wasteful to build a full decision tree just to obtain a single rule, but the process can be accelerated significantly without sacrificing the advantages. The key idea is to build a partial decision tree instead of a fully explored one. A partial decision tree is an ordinary decision tree that contains branches to undefined sub trees. To generate such a tree, the construction and pruning operations are integrated in order to find a “stable” sub tree that can be simplified no further. Once this sub tree has been found, tree building ceases and a single rule is read off.

The tree-building algorithm is summarized in below Figure: It splits a set of instances recursively into a partial tree. The first step chooses a test and divides the instances into subsets accordingly. The choice is made using the same information-gain heuristic that is normally used for building decision trees. Then the subsets are expanded in increasing order of their average entropy. The reason for this is that the later subsets will most likely not end up being expanded, and a subset with low average entropy is more likely to result in a small sub tree and therefore produce a more general rule. This proceeds recursively until a subset is expanded into a leaf, and then continues further by backtracking. But as soon as an internal node appears that has all its children expanded into leaves, the algorithm checks whether that node is better replaced by a single leaf. This is just the standard sub tree replacement operation of decision tree pruning. If replacement is performed the algorithm backtracks in the standard way, exploring siblings of the newly replaced node. However, if during backtracking a node is encountered all of whose children expanded so far are not leaves—and this will happen as soon as a potential sub tree replacement is not performed—then the remaining subsets are left unexplored and the corresponding sub trees are left undefined. Due to the recursive structure of the algorithm, this event automatically terminates tree generation.

References:

Witten, I. H., Frank, E., Hall, M. A., (2011), Data Mining: Practical Machine Learning Tools and Techniques, the Morgan Kaufmann Series in Data Management Systems.

Scatter Plots and Singular value decomposition (SVD)

Singular value decomposition (SVD) can be looked at from three mutually compatible points of view. On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. At the same time, SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. This ties in to the third way of viewing SVD, which is that once we have identified where the most variation is, it's possible to find the best approximation of the original data points using fewer dimensions. Hence, SVD can be seen as a method for data reduction. (Baker, 2005)

These are the basic ideas behind SVD: taking a high dimensional, highly variable set of data points and reducing it to a lower dimensional space that exposes the substructure of the original data more clearly and orders it from most variation to the least. What makes SVD practical for NLP (Natural Language Process) applications is that you can simply ignore variation below a particular threshhold to massively reduce your data but be assured that the main relationships of interest have been preserved. (Baker, 2005)

Experiment

For this experiment data set was taken from tripadvisor.com website. We performed Singular Value Decomposition (SVD) on the data set. The first 20 components were retrieved for analysis. Also, we created a Scree Plot to see the number of singular values that are useful for the analyses. We can tell by looking at the following scree plot that the first component explains almost 13%, the second one explains slightly more than 6% and in total these two components explain 19% of the total variance.

By using these two components we created the following scatter plots to visualize the relationship between the words. The points appearing close to each other are related with each other.

The two plots are exactly the same. For the second one, we showed the labels for each point and appointed colors to the points that have negative or positive meaning. For example, we appointed green for the words such as clean, close, love, good, and purple for the words such as bad, dirty, far, never, expense and so on. For the words that we could not assign into any of the two groups, such as center, Russian, breakfast, and so on, we just left as they are, black. The nearest points for Istanbul are mosque, love, neighborhood, enjoy, convenient, quiet, and complain. In a similar way, the nearest points for New York are comfort, clean, nice, shower, help, small, locate and staff, and for Moscow, better, close, food, expense, never, old, busy, bar and serve. In the circles that we draw around each city, there are some green, purple and black words. The greens are positive, purples are negative, and blacks are neural words/variables.

1- Singular Value Decomposition Tutorial, Kirk Baker, March 29, 2005

She got a Big Data, so I call her Big Data

Big Data is a huge proponent in health care. With technological advances, soon it will be the norm for people to have wearable or internal signals gathering data on our biological processes.

“It is likely to happen even before we figure out the etiquette and laws around sharing this knowledge.”

Companies, such as Nike with Nike+ FuelBand, already track peoples’ daily activity from steps taken to amount of calories burned. Their focus is to help people develop an exercise routine and weight loss wouldn’t hurt.

Another company, MC10, will offer ‘stretchable electronics’ for clothing, as temporary tattoos, or installed within the body. The company says they will be capable of measuring heart rate, brain activity, body temperature, and hydration levels. Much like MC10, another company, Proteus, will start its “Digital Health Feedback System”, which will also have microchips in pill form that have the ability to capture vital data, powered by your own stomach fluids.

“Ultimately, we see ourselves as a part of the healthcare ecosystem. Data will need to be shared seamlessly between customers, providers, and payers in order to reduce healthcare costs and simultaneously deliver the best possible care.” – Amar Kendale, MC10’s VP of market strategy and development

With all these companies utilizing big data methodologies and concepts, it’s exciting to see how the future will progress in healthcare. The only thing that puts me off on the grand scheme of things, is how ‘deterministic’ the world is turning.

To each his own, but I guess in a few years, we’ll probably know exactly what those are.

Article: http://bits.blogs.nytimes.com/2012/09/07/big-data-in-your-blood/