Analytics and Visualization of Big Data: Distance-based clusterings

Clustering is an important unsupervised learning method. The main idea is to cluster data points (or feature vectors, observations) into groups (Jain, Murty & Flynn, 1999) and so get a classified structure in a collection of unlabelled data, based on similarity criterion. In contrast to classification, in clustering, data points are appointed into groups whose members have similar properties in some way (Moore, 2001).

The main similarity criterion is distance. The data points which are closer to each other by comparing to the other points are considered in the same cluster. This is called distance-based clustering. The distance between points is given by the Euclidean distance:

where x and y are any data points on two dimensional space.

The first distance-based clustering procedure is Hierarchical Clustering.

Hierarchical clustering is a set of nested sets. The clustering based on Euclidean distance works by merging 2 clusters at a time.

Another important distance-based clustering algorithm is K-Means Clustering.

K-Means is a simple and unsupervised clustering algorithm in data mining. The general idea in this method is to separate a sum of observations into clusters. The separation is done according to means of clusters; each observation is classified into a cluster with the nearest mean (centroid) (MacQueen, 1967).

Dramatic differences between the sizes, densities of clusters, empty clusters and outliers may be problem for this algorithm (Kumar, 2002).

1- MacQueen, J. B. (1967). "Some Methods for classification and Analysis of Multivariate Observations". 1. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. pp. 281–297

2- JAIN, A.K., Michigan State University, MURTY, M.N., Indian Institute of Science & FLYNN, P.J., The Ohio State University ACM Computing Surveys. (1999). Data Clustering: A Review, Vol. 31, No. 3, September 1999

3- Moore, A. (2001). Carnegie Mellon University. K-means and Hierarchical Clustering - Tutorial Slides, November 16^th 2001. http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html

4- Kumar, V. (2002). Lecture Notes. Parallel Issues in Data Mining, VECPAR 2002.

Analytics and Visualization of Big Data

Thursday, March 14, 2013

Distance-based clusterings

2 comments: