Agglomerative Hierarchical Clustering aggregation methods

XLSTAT proposes several aggregation methods:

  • Ward's method (variance)
  • Complete linkage
  • Simple linkage
  • Strong linkage
  • Average linkage 
  • Flexible linkage
  • Unweighted pair-group average
  • Weighted pair-group average

So far we have defined a distance between items. The linkage function tells you to measure the distance between clusters. Again, there are many choices.  Typically you consider either a new item that summarizes the items in the cluster, or a new distance that summarizes the distance between the items in the cluster and items in other clusters.  Here is a list of four methods.  In each example, x is in one cluster and y is in the other.

Single (string-like, long)f=min(d(x,y))
Complete (ball-like, compact)f=max(d(x,y))
Average (ball-like, compact)f=average(d(x,y))
Centroid (ball-like, compact)d(ave(X),ave(Y)) where we take the average over all items in each cluster

For example, suppose we have two clusters C1 and C2 with elements xij  where i is the cluster and j is the item in the cluster. D(C1,C2) is a function of the distances f{d(x1i,x2j)}.  Single linkage clusters looks at all the pairwise distances between the items in the two clusters and takes the distance between the clusters as the minimum distance. Complete linkage, which is more popular, takes the maximum distance. Average linkage takes the average, which as it turns out is fairly similar to complete linkage. Centroid linkage sounds the same as average linkage but instead of using the average distance, it creates a new item which is the average of  all the individual items and then uses the distance between averages. Single and complete linkage give the same dendrogram whether you use the raw data, the log of the data or any other transformation of the data that preserves the order because what matters is which ones have the smallest distance. The other methods are sensitive to the measurement scale. The most popular methods for gene expression data are to use log2(expression + 0.25), correlation distance and complete linkage clustering.

Maximum or complete linkage clustering: 

It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters. The complete Linkage method is also known as the Maximum Linkage (MAX) method. In the Complete Linkage technique, the distance between two clusters is defined as the maximum distance between an object (point) in one cluster and an object (point) in the other cluster. And this method is also known as the furthest neighbor method.

Pros of Complete Linkage

  • Complete Linkage algorithms are less susceptible to noise and outliers.
  • That means the Complete Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Complete Linkage

  • Complete linkage methods tend to break large clusters.
  • Complete Linkage is biased towards globular clusters.

Minimum or single linkage clustering:

It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters. Simple Linkage is also known as the Minimum Linkage (MIN) method. In the Single Linkage method, the distance of two clusters is defined as the minimum distance between an object (point) in one cluster and an object (point) in the other cluster. This method is also known as the nearest neighbor method.

Pros of Simple Linkage

  • Simple Linkage methods can handle non-elliptical shapes.
  • Single Linkage algorithms are the best for capturing clusters of different sizes.

Cons of Simple Linkage

  • Simple Linkage methods are sensitive to noise and outliers.
  • That means Simple Linkage methods can not group clusters properly if there is any noise between the clusters.

Mean or average linkage clustering:

It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters. In the Average Linkage technique, the distance between two clusters is the average distance between each cluster’s point to every point in the other cluster. This method is also known as the unweighted pair group method with arithmetic mean.

Pros of Average Linkage

  • The average Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Average Linkage

  • The average Linkage method is biased towards globular clusters.

Centroid linkage clustering:

It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2. In the Centroid Linkage approach, the distance between the two sets or clusters is the distance between two mean vectors of the sets (clusters). At each stage, we combine the two sets that have the smallest centroid distance. In simple words, it is the distance between the centroids of the two sets.

Pros of Centroid Linkage

  • The Centroid Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Centroid Linkage

  • Similar to Complete Linkage and Average Linkage methods, the Centroid Linkage method is also biased towards globular clusters.

Ward’s minimum variance method: 

It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. Ward's Linkage method is the similarity of two clusters. Which is based on the increase in squared error when two clusters are merged, and it is  similar to the group average if the distance between points is distance squared. 

Pros of Ward's Linkage

  • In many cases, Ward’s Linkage is preferred as it usually produces better cluster hierarchies
  • Ward’s method is less susceptible to noise and outliers.

Cons of Ward's Linkage

  • Ward’s linkage method is biased towards globular clusters.

Weighted pair group method 

The method used in this example is called WPGMA (weighted pair group method with averaging) because the distance between clusters is calculated as a simple average.  For example, in the last step the WPGMA distance between (AB) and C+(DE) =  (55 + 90) / 2 = 72.5 . Though computationally easier, when there are unequal numbers of taxa in the clusters, the distances in the original matrix do not contribute equally to the intermediate calculations.

Unweighted pair group method 

A superior method is UPGMA (unweighted PGMA), in which averages are weighted by the number of taxa in each cluster at each step. This makes the calculation slightly more complicated. For example, in the last step the UPGMA distance between (AB) and C+(DE) =  (55 + 2x90) / 3 = 78.33, because the distance is the average of three distances, (AB) to C and to D and to E . As a result, each distance contributes equally to the final result.