Analytics and Visualization of Big Data: Basics of Clustering and problems

Definition of a Cluster: A cluster is a set (2+) server nodes dedicated to keep application services alive, communicating through the cluster software/framework with eachother, test and probe health status of servernodes/services and with quorum based decisions and with switchover/failover techniques keep the application services running on them available. That is, should a node that runs a service unexpectedly lose functionality/connection, the other ones would take over the and run the services, so that availability is guaranteed. To provide availability while strictly sticking to a consistent cluster configuration is the main goal of a cluster.

At this point we have to add that this defines a HA-cluster, a High-Availability cluster, where the clusternodes are planned to run the services in an active-standby, or failover fashion. An example could be a single instance database. Some applications can be run in a distributed or scalable fashion. In the latter case instances of the application run actively on separate clusternodes serving servicerequests simultaneously. An example for this version could be a webserver that forwards connection requests to many backend servers in a round-robin way. Or a database running in active-active RAC setup.

Now, what is a cluster made of? Servers, right. These servers (the clusternodes) need to communicate. This of course happens over the network, usually over dedicated network interfaces interconnecting all the clusternodes. These connection are called interconnects.

How many clusternodes are in a cluster? There are different cluster topologies. The most simple one is a clustered pair topology, involving only two clusternodes:

There are several more topologies, clicking the image above will take you to the relevant documentation.
Also, to answer the question Solaris Cluster allows you to run up to 16 servers in a cluster.

Where shall these clusternodes be placed? A very important question. The right answer is: It depends on what you plan to achieve with the cluster. Do you plan to avoid only a server outage? Then you can place them right next to eachother in the datacenter. Do you need to avoid DataCenter outage? In that case of course you should place them at least in different fire zones. Or in two geographically distant DataCenters to avoid disasters like floods, large-scale fires or power outages. We call this a stretched- or campus cluster, the clusternodes being several kilometers away from eachother. To cover really large distances, you probably need to move to a GeoCluster, which is a different kind of animal.

There are a number of problems with clustering. Among them:

current clustering techniques do not address all the requirements adequately (and concurrently);
dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);
if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces;
the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.

References:

1. http://guide.couchdb.org/draft/clustering.html

2. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

1 comment:

UnknownMarch 27, 2013 at 1:44 AM
I was just working on a paper that I found out about a java based software for mining the data of literature. "citespace" clusters the literature based on their keyword, Authors, Citations and ... .
the software is java based so there is no setup required and can work on almost anything. if you are working on a paper I really suggest that you use citespace for reviewing the literature available in that field.

Tuesday, March 19, 2013

Basics of Clustering and problems

1 comment: