Singular
value decomposition (SVD) can be looked at from three mutually compatible
points of view. On the one hand, we can see it as a method for transforming
correlated variables into a set of uncorrelated ones that better expose the
various relationships among the original data items. At the same time, SVD is a
method for identifying and ordering the dimensions along which data points
exhibit the most variation. This ties in to the third way of viewing SVD, which
is that once we have identified where the most variation is, it's possible to find
the best approximation of the original data points using fewer dimensions.
Hence, SVD can be seen as a method for data reduction. (Baker,
2005)
These
are the basic ideas behind SVD: taking a high dimensional, highly variable set
of data points and reducing it to a lower dimensional space that exposes the
substructure of the original data more clearly and orders it from most
variation to the least. What makes SVD practical for NLP (Natural Language
Process) applications is that you can simply ignore variation below a
particular threshhold to massively reduce your data but be assured that the
main relationships of interest have been preserved. (Baker,
2005)
Experiment
For this experiment data
set was taken from tripadvisor.com website. We performed Singular Value
Decomposition (SVD) on the data set. The first 20 components were retrieved for
analysis. Also, we created a Scree Plot to see the number of singular values
that are useful for the analyses. We can tell by looking at the following scree
plot that the first component explains almost 13%, the second one explains
slightly more than 6% and in total these two components explain 19% of the
total variance.
By using these two components we
created the following scatter plots to visualize the relationship between the
words. The points appearing close to each other are related with each other.
The
two plots are exactly the same. For the second one, we showed the labels for
each point and appointed colors to the points that have negative or positive
meaning. For example, we appointed green for the words such as clean, close,
love, good, and purple for the words such as bad, dirty, far, never, expense
and so on. For the words that we could not assign into any of the two groups,
such as center, Russian, breakfast, and so on, we just left as they are, black.
The nearest points for Istanbul are mosque, love, neighborhood, enjoy,
convenient, quiet, and complain. In a similar way, the nearest points for New
York are comfort, clean, nice, shower, help, small, locate and staff, and for Moscow,
better, close, food, expense, never, old, busy, bar and serve. In the circles
that we draw around each city, there are some green, purple and black words.
The greens are positive, purples are negative, and blacks are neural
words/variables.
1- Singular Value Decomposition Tutorial, Kirk Baker, March 29, 2005
No comments:
Post a Comment