Analytics and Visualization of Big Data: Scatter Plots and Singular value decomposition (SVD)

Singular value decomposition (SVD) can be looked at from three mutually compatible points of view. On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. At the same time, SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. This ties in to the third way of viewing SVD, which is that once we have identified where the most variation is, it's possible to find the best approximation of the original data points using fewer dimensions. Hence, SVD can be seen as a method for data reduction. (Baker, 2005)

These are the basic ideas behind SVD: taking a high dimensional, highly variable set of data points and reducing it to a lower dimensional space that exposes the substructure of the original data more clearly and orders it from most variation to the least. What makes SVD practical for NLP (Natural Language Process) applications is that you can simply ignore variation below a particular threshhold to massively reduce your data but be assured that the main relationships of interest have been preserved. (Baker, 2005)

Experiment

For this experiment data set was taken from tripadvisor.com website. We performed Singular Value Decomposition (SVD) on the data set. The first 20 components were retrieved for analysis. Also, we created a Scree Plot to see the number of singular values that are useful for the analyses. We can tell by looking at the following scree plot that the first component explains almost 13%, the second one explains slightly more than 6% and in total these two components explain 19% of the total variance.

By using these two components we created the following scatter plots to visualize the relationship between the words. The points appearing close to each other are related with each other.

The two plots are exactly the same. For the second one, we showed the labels for each point and appointed colors to the points that have negative or positive meaning. For example, we appointed green for the words such as clean, close, love, good, and purple for the words such as bad, dirty, far, never, expense and so on. For the words that we could not assign into any of the two groups, such as center, Russian, breakfast, and so on, we just left as they are, black. The nearest points for Istanbul are mosque, love, neighborhood, enjoy, convenient, quiet, and complain. In a similar way, the nearest points for New York are comfort, clean, nice, shower, help, small, locate and staff, and for Moscow, better, close, food, expense, never, old, busy, bar and serve. In the circles that we draw around each city, there are some green, purple and black words. The greens are positive, purples are negative, and blacks are neural words/variables.

1- Singular Value Decomposition Tutorial, Kirk Baker, March 29, 2005

Analytics and Visualization of Big Data

Friday, April 5, 2013

Scatter Plots and Singular value decomposition (SVD)

No comments:

Post a Comment