Introduction to K-Means Clustering

K-means (MacQueen, 1967) is a partitional clustering algorithm. K-means clustering is a simple, yet widely used approach for classification. It takes a statistical vector as an input to deduce classification models or classifiers. K-means clustering tends to distribute m observations into n clusters where each observation belongs to the nearest cluster. The membership of an observation to a cluster is determined using the cluster mean. K-means clustering is used in numerous applications in the domains of network analysis and traffic classification. K-means clustering provided excellent results when used in traffic classification.

Let the set of data points D be {x1 , x2 , …, xn }, where xi = (xi1 , xi2 , …, xin) is a vector in X\subseteq R^{r} , and r is the number of dimensions.

The k-means algorithm partitions the given data into k clusters: Each cluster has a cluster center called centroid. k is specified by the user. The idea behind the k-means algorithm, discussed by Hartigan, is that each of k clusters can be represented by the mean of the documents assigned to that cluster, which is called the centroid of that cluster. It is discussed by Berkhin in that there are two versions of k-means algorithm known. The first version is the batch version and is also known as Forgy’s algorithm.

It consists of the following two-step major iterations:

(i) Reassign all the documents to their nearest centroids.

(ii) Recompute centroids of newly assembled groups Before the iterations start, firstly k documents are selected as the initial centroids. Iterations continue until a stopping criterion such as no reassignments occur is achieved.

K-means algorithm

Given k, the k-means algorithm works as follows:

The k−Means algorithm performs a heuristic search while satisfying both requirements. 

  1.  Choose k (random) data points (seeds) to be the initial centroids, cluster centers, Perform an initial selection of centroids. Specify the number of clusters and, arbitrarily or deliberately, the members of each cluster.
     
  2.  Assign each data point to the closest centroid. Iterate on samples , for each sample Find the closest centroid. Assign the sample to the corresponding cluster. Calculate each cluster's centroid", and the distances between each observation and centroid. If an observation is nearer the centroid of a cluster other than the one to which it currently belongs, re-assign it to the nearer cluster.
     
  3.  Repeat Step 2 until all observations are nearest the centroid of the cluster to which they belong.
     
  4. If the number of clusters cannot be specified with confidence in advance, repeat Steps 1 to 3 with a different number of clusters and evaluate the results

K-means convergence (stopping) criterion

  • no (or minimum) re-assignments of data points to different clusters, 
  • no (or minimum) change of centroids, 
  • minimum decrease in the sum of squared error (SSE),
    SSE=\sum_{j=1}^{k}\sum_{x\in C_{j}}^{}d\left ( x, m_{j} \right )^{2}

     Cj is the jth cluster,
     mj is the centroid of cluster Cj (the mean vector of all the data points in Cj ),
     d(x, mj ) is the (Eucledian) distance between data point x and centroid mj .

Why use K-means?

1.) Strengths:

  • Simple: easy to understand and to implement
  • Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations.
  • Since both k and t are small. k-means is considered a linear algorithm.

2.) K-means is the most popular clustering algorithm.

3.) It terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.

Weaknesses of K-means

1.) The algorithm is only applicable if the mean is defined. 

  • For categorical data, k-mode - the centroid is represented by most frequent values. 

2.) The user needs to specify k. 

3.) The algorithm is sensitive to outliers

  • Outliers are data points that are very far away from other data points.
  • Outliers could be errors in the data recording or some special data points with very different values.

K-means summary

1.) Despite weaknesses, k-means is still the most popular algorithm due to its simplicity and efficiency

2.) No clear evidence that any other clustering algorithm performs better in general

3.) Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!