Agglomerative Hierarchical Clustering

Agglomerative cluster is a common type of Hierarchical Clustering that is also called Agglomerative nesting (AGNES). That is, each observation is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are a member of just one single big cluster (root). The result is a tree which can be displayed using a dendrogram. Agglomerative Hierarchical Clustering is popularly known as a bottom-up approach, wherein each data or observation is treated as its cluster. A pair of clusters are combined until all clusters are merged into one big cluster that contains all the data.

In Unsupervised Machine Learning, many clustering algorithms are used to group objects for analysis and finding patterns. One commonly known technique is Agglomerative Clustering, where objects that are close to each other are placed in one group. In the beginning, all objects are single clusters (leaves) and the algorithm keeps on clustering objects until a single cluster (roots) remains. Clustering forms a tree like structure called a dendrogram. Each observation starts with its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

It works from the dissimilarities between the objects to be grouped together. A type of dissimilarity can be suited to the subject studied and the nature of the data.
One of the results is the dendrogram which shows the progressive grouping of the data. It is then possible to gain an idea of a suitable number of classes into which the data can be grouped.

That means the algorithm considers each data point as a single cluster initially and then starts combining the closest pair of clusters together. It does the same process until all the clusters are merged into a single cluster that contains all the datasets.

As this method follows bottom-up approach. This algorithm works by forming one cluster, then again forming sub clusters and so on to form one big cluster. In other words, suppose we have 5 points a,b,c,d and e. In the first step, a and b forms a cluster C1. Next, c and d will form a cluster C2. Now, C1 and C2 in final step will form a cluster with E. Generally, clusters that are formed first are small and they clusters in later stage are bigger than these. To visualize the clusters, dendrogram diagrams are used.

Agglomerative Hierarchical Clustering Algorithm

Start with taking each data point as singleton cluster
Compute the distance between all pair of points. The resultant matrix will be a square matrix.
Now on the basis of the distance between pair of points, select the ones with minimum distance. Merge them into one cluster. Represent the cluster formed in dendrogram. I will discuss how to draw dendrogram in later sections.
Update the distance matrix by deleting the rows corresponding to points that have merged into cluster.
Now moving further, again calculate distance between all pair of points after merging of points is done.
Again merge the points between which the distance is minimum.
Keep merging the points and forming the clusters until there is one big cluster and side by side represent your clusters that are forming in dendrogram.

Note: When you are merging the points in one cluster, there are several distance metrics used to combine them like complete linkage, average linkage. I will be talking about these later.

Parameters

mode: This parameter specifies the cluster mode or the linkage criterion.Range: selection

SingleLink: In single-link hierarchical clustering, we merge in each step the two clusters whose two closest members have the smallest distance (or: the two clusters with the smallest minimum pairwise distance).
CompleteLink: In complete-link hierarchical clustering, we merge in each step the two clusters whose merger has the smallest diameter (or: the two clusters with the smallest maximum pairwise distance).
AverageLink: Average-link clustering is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects.

measure_types : This parameter is used for selecting the type of measure to be used for measuring the distance between points.The following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences.Range: selection

mixed_measure : This parameter is available when the measure type parameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance'Range: selection

nominal_measure : This parameter is available when the measure type parameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. In this case the 'numerical measure' option should be selected.Range: selection

numerical_measure : This parameter is available when the measure type parameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. If the input ExampleSet has nominal attributes the 'nominal measure' option should be selected.Range: selection

divergence : This parameter is available when the measure type parameter is set to 'bregman divergences'.Range: selection

kernel_type : This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported : Range: selection

dot: The dot kernel is defined by k(x,y)=x*y i.e.it is inner product of x and y.
radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma that is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
sigmoid: This is the sigmoid kernel. Please note that the sigmoid kernel is not valid under some parameters.
anova: This is the anova kernel. It has adjustable parameters gamma and degree.
epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1 and kernel degree.
gaussian_combination: This is the gaussian combination kernel. It has adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
multiquadric: The multiquadric kernel is defined by the square root of ||x-y||² + c². It has adjustable parameters kernel sigma1 and kernel sigma shift.

kernel_gamma : This is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova.Range: real

kernel_sigma1 : This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric.Range: real

kernel_sigma2 : This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.Range: real

kernel_sigma3 : This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.Range: real

kernel_shift : This is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric.Range: real

kernel_degree : This is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov.Range: real

kernel_a : This is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.Range: real

kernel_b : This is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.Range: real

How Agglomerative Hierarchical clustering Algorithm Works?

Agglomerative Hierarchical Clustering (AHC) is an iterative classification method whose principle is simple.

The process starts by calculating the dissimilarity between the N objects.
Then two objects which when clustered together minimize a given agglomeration criterion, are clustered together thus creating a class comprising these two objects.
Then the dissimilarity between this class and the N-2 other objects is calculated using the agglomeration criterion. The two objects or classes of objects whose clustering together minimizes the agglomeration criterion are then clustered together.

This process continues until all the objects have been clustered. These successive clustering operations produce a binary clustering tree (dendrogram), whose root is the class that contains all the observations. This dendrogram represents a hierarchy of partitions. It is then possible to choose a partition by truncating the tree at a given level, the level depending upon either user-defined constraints (the user knows how many classes are to be obtained) or more objective criteria.

Algorithm, for a set of N observations to be clustered:

Start assigning each observation as a single point cluster, so that if we have N observations, we have N clusters, each containing just one observation.
Find the closest (most similar) pair of clusters and make them into one cluster, we now have N-1 clusters. This can be done in various ways to identify similar and dissimilar measures (Explained in a later section)
Find the two closest clusters and make them to one cluster. We now have N-2 clusters. This can be done using agglomerative clustering linkage techniques (Explained in a later section).
Repeat steps 2 and 3 until all observations are clustered into one single cluster of size N.

Clustering algorithms use various distance or dissimilarity measures to develop different clusters. Lower/closer distance indicates that data or observation are similar and would get grouped in a single cluster. Remember that the higher the similarity depicts observation is similar.

This is called a greedy algorithm. It looks at only the current state and does the best it can at that stage and does not look ahead to see whether another choice would be better in the long run. If you join two items into the same group early on you cannot determine if a cluster later develops that is actually closer to one of the items. For this reason, you never get to 'shuffle' and put an item back into a better group.

A problem with the algorithm occurs when there are two pairs that could be merged at a particular stage. Only one pair is merged - usually the pair that is first in the data matrix. After this pair is merged the distance matrix is updated, and it is possible that the second pair is no longer closest. If you had picked the other pair first, you could get a different clustering sequence. This is typically not a big problem but could be if it happens early on. The only way to see if this has happened, is to shuffle the items and redo the clustering method to see if you get a different result. Step 2 can be done in various ways to identify similar and dissimilar measures. Namely,

Euclidean Distance
Manhattan Distance
Minkowski Distance
Jaccard Similarity Coefficient
Cosine Similarity
Gower’s Similarity Coefficient

Agglomerative clustering with Scikit-Learn

Firstly, import all the required libraries. Then, generate a 2D array of all the data points with their coordinates array. After you initialize the Agglomerative Clustering model, call the fit method on it. Lastly, plot the dendrogram to see the clustering results.

The Agglomerative function takes distance threshold and n_clusters as parameters.

distance threshold : It is the linkage distance threshold above which clusters will not be merged, and it shows the limit at which to cut the dendrogram tree.

n_clusters : It shows the number of clusters to find.

Application of hierarchical clustering to gene expression data analysis

In gene expression data analysis, clustering is generaly used as one of the first step to explore the data. We are interested in whether there are groups of genes or groups of samples that have similar gene expression patterns.
Several clustering distance measures have been described for assessing the similarity or the dissimilarity between items, in order to decide which items have to be grouped together or not. These measures can be used to cluster genes or samples that are similar.
For most common clustering softwares, the default distance measure is the Euclidean distance. The most popular methods for gene expression data are to use log2(expression + 0.25), correlation distance and complete linkage clustering agglomerative-clustering.
Single and Complete linkage give the same dendrogram whether you use the raw data, the log of the data or any other transformation of the data that preserves the order because what matters is which ones have the smallest distance. The other methods are sensitive to the measurement scale.
In principle it is possible to cluster all the genes, although visualizing a huge dendrogram might be problematic. Usually, some type of preliminary analysis, such as differential expression analysis is used to select genes for clustering.
Selecting genes based on differential expression analysis removes genes which are likely to have only chance patterns. This should enhance the patterns found in the gene clusters.
Note that, when the data are scaled, the Euclidean distance of the z-scores is the same as correlation distance.
Pearson’s correlation is quite sensitive to outliers. When clustering genes, it is important to be aware of the possible impact of outliers. An alternative option is to use Spearman’s correlation instead of Pearson’s correlation.

Advantages

The agglomerative technique is easy to implement.
It can produce an ordering of objects, which may be informative for the display.
In agglomerative Clustering, there is no need to pre-specify the number of clusters.
By the Agglomerative Clustering approach, smaller clusters will be created, which may discover similarities in data.

Disadvantages

The agglomerative technique gives the best result in some cases only.
The algorithm can never undo what was done previously, which means if the objects may have been incorrectly grouped at an earlier stage, and the same result should be close to ensure it.
The usage of various distance metrics for measuring distances between the clusters may produce different results. So performing multiple experiments and then comparing the result is recommended to help the actual results’ veracity.

Summary

Hierarchical clustering is a cluster analysis method, which produce a tree-based representation (i.e.: dendrogram) of a data. Objects in the dendrogram are linked together based on their similarity.

To perform hierarchical cluster analysis in R, the first step is to calculate the pairwise distance matrix using the function dist(). Next, the result of this computation is used by the hclust() function to produce the hierarchical tree. Finally, you can use the function fviz_dend() [in factoextra R package] to plot easily a beautiful dendrogram.

It’s also possible to cut the tree at a given height for partitioning the data into multiple groups (R function cutree()).