Clustering is an important unsupervised
learning method. The main idea is to cluster data points (or feature vectors,
observations) into groups (Jain,
Murty & Flynn, 1999) and
so get a classified structure in a collection of unlabelled data, based on
similarity criterion. In
contrast to classification, in clustering, data points are appointed
into groups whose members have similar properties in some way (Moore, 2001).
The main similarity criterion is distance. The
data points which are closer to each other by comparing to the other points are
considered in the same cluster. This is called distance-based clustering.
The distance between points is given by the Euclidean distance:
where x and y are any data points on two dimensional space.
The first distance-based clustering
procedure is Hierarchical Clustering.
Hierarchical
clustering is a set of nested
sets. The clustering based on Euclidean distance works by merging 2 clusters at
a time.
Another
important distance-based clustering algorithm is K-Means Clustering.
K-Means is a simple and unsupervised clustering algorithm in data mining. The general idea in
this method is to separate a sum of observations into clusters. The
separation is done according to means of clusters; each observation is
classified into a cluster with the nearest mean (centroid) (MacQueen, 1967).
Dramatic
differences between the sizes, densities of clusters, empty clusters and
outliers may be problem for this algorithm (Kumar, 2002).
1- MacQueen, J. B. (1967). "Some
Methods for classification and Analysis of Multivariate Observations". 1. Proceedings of 5th Berkeley
Symposium on Mathematical Statistics and Probability. University of California
Press. pp. 281–297
2- JAIN, A.K., Michigan
State University, MURTY, M.N., Indian Institute of Science &
FLYNN, P.J., The Ohio State
University ACM Computing Surveys. (1999). Data Clustering: A Review, Vol. 31, No. 3, September 1999
3- Moore,
A. (2001).
Carnegie Mellon University. K-means and Hierarchical
Clustering - Tutorial Slides, November 16th 2001. http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html
4- Kumar, V. (2002). Lecture Notes. Parallel Issues
in Data Mining, VECPAR 2002.
Security Intelligence Solution provides one-click access to a comprehensive forensic trail and analytics in the same solution to simplify and accelerate threat discovery and incident investigation. To know more, visit Hadoop Training Bangalore
ReplyDeleteAwesome post with great piece of information. Glad that I found your post.
ReplyDeleteTally Course in Chennai
Tally Training in Chennai
Ionic Training in Chennai
Spark Training in Chennai
Excel Training in Chennai
VMware Training in Chennai
Microsoft Dynamics CRM Training in Chennai
Embedded Training in Chennai