Introduction to Hierarchical Clustering

Hierarchical Clustering deals with the data in the form of a tree or a well-defined hierarchy. In most of the analytical projects, after data cleaning and preparation, clustering techniques are often carried out before predictive or other analytical modeling. There are a number of different approaches to generating clusters such as K-Means clustering, density-based clustering, Gaussian Mixture Models clustering, and Hierarchical clustering. Each method has its own advantages and disadvantages. Several of the clustering algorithms only group the objects into different clusters and the resulting groups can be viewed on a cluster plot. A hierarchical clustering algorithm in addition to breaking up the objects into clusters also shows the hierarchy or ranking of the distance and shows how dissimilar one cluster is from the other. In this article we will only focus on the hierarchical clustering.

Clustering is one of the popular techniques used to create homogeneous groups of entities or objects. For a given set of data points, grouping the data points into X number of clusters so that similar data points in the clusters are close to each other. Hierarchical clustering is one of the popular clustering techniques after K-means Clustering. It is also known as Hierarchical Clustering Analysis (HCA)

Which is used to group unlabelled datasets into a Cluster. This Hierarchical Clustering technique builds clusters based on the similarity between different objects in the set. It goes through the various features of the data points and looks for the similarity between them. This process will continue until the dataset has been grouped. Which creates a hierarchy for each of these clusters.

Because of this reason, the algorithm is named as a hierarchical clustering algorithm. This hierarchy way of clustering can be performed in two ways.

Agglomerative: Hierarchy created from bottom to top.
Divisive: Hierarchy created from top to bottom.

Clustering falls under the category of unsupervised learning. Meaning, there is no labeled class or target variable for a given dataset. We are only interested in grouping similar records or objects in a cluster. Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other. Hierarchical Clustering creates clusters in a hierarchical tree-like structure (also called a Dendrogram). Meaning, a subset of similar data is created in a tree-like structure in which the root node corresponds to entire data, and branches are created from the root node to form several clusters.

Why Hierarchical Clustering?

It uses a pre-specified number of clusters. It requires advanced knowledge of K., i.e., how to define the number of clusters one wants to divide your data. Still, in hierarchical clustering no need to pre-specify the number of clusters as we did in the K-Means Clustering; one can stop at any number of clusters. Furthermore, Hierarchical Clustering has an advantage over K-Means Clustering. i.e., it results in an attractive tree-based representation of the observations, called a Dendrogram.

How does it work?

Hierarchical clusters can be built either as agglomerative or divisive. In the agglomerative model, the two closest nodes are combined together into one node and this process is repeated until all the nodes have been grouped together. In the divisive approach, all the nodes are together initially, and these are broken up into two groups and so on until only individual nodes remain. In most cases, we will be using the agglomerative model to create the hierarchical clusters.

Hierarchical clustering is an algorithm that groups similar objects into groups or clusters often without prior information of the data structure. The objects within a group are similar to each other and objects in one group are dissimilar to the objects in another group. For example, if you have several patients who come to visit your clinic, based on their symptoms you can probably group patients into different groups where each group of patients are broadly similar to each other in terms of the symptoms. Hierarchical clustering can be divided into two main types: agglomerative and divisive.

Agglomerative clustering: It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). The result is a tree which can be plotted as a dendrogram.
Divisive hierarchical clustering: It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster.

Steps to Perform Hierarchical Clustering

Following are the steps involved in agglomerative clustering:

At the start, treat each data point as one cluster. Therefore, the number of clusters at the start will be K, while K is an integer representing the number of data points.
Form a cluster by joining the two closest data points resulting in K-1 clusters.
Form more clusters by joining the two closest clusters resulting in K-2 clusters.
Repeat the above three steps until one big cluster is formed.
Once single cluster is formed, dendrograms are used to divide into multiple clusters depending upon the problem. We will study the concept of dendrogram in detail in an upcoming section.

There are different ways to find distance between the clusters. The distance itself can be Euclidean or Manhattan distance. Following are some of the options to measure distance between two clusters:

Measure the distance between the closes points of two clusters.
Measure the distance between the farthest points of two clusters.
Measure the distance between the centroids of two clusters.
Measure the distance between all possible combination of points between the two clusters and take the mean.

Applications

Hierarchical clustering can be used in a number of different areas where we would like to understand the “natural” grouping of the objects. Here are some examples:

Cluster search results so that similar results are grouped and shown together
Relate different species into groups based on their genetic DNA makeup
Tracking viral outbreak based on mutation of viral DNA
Group loan applicants into high/medium/low risk based on attributes
Organize customers into groups/segments with similar traits, product preferences
Social network analysis based on tweets
Anomaly detection in medical imaging
Diagnostic questionnaire for symptoms to identify patients with similar symptoms
Identify homogeneous groups of customers with similar needs and attitudes
Group students into similar groups based on psychological, aptitude and achievement scores

Distance Calculation

There are several ways to calculate distance between two points. In the above example, we used the Euclidean calculation for distance which is the default and most commonly used. However, depending on your application, you may need to pick a different way to calculate the distances. Here are the available options:

Binary: The two points are regarded as binary bits with non-zero elements as “on” and zero elements as “off”. The distance between them is the proportion of bits in which only one is “on” vs. the bits in which at least one is “on”.
Canberra: The Canberra distance is based on the sum of a series of fraction differences between the coordinates of a pair of objects
Euclidean: The Euclidean distance is the straight line distance between two points. This distance is also termed as the L2 distance.
Manhattan: The Manhattan distance measures the distance between two points measured along the axes at right angles.

Distance Between Clusters

Computing the distance between two nodes is clear using the distance calculation listed above. However, if each cluster contains different number of nodes, how do we compute the distance between the clusters? There are several approaches to do this and there is no one right answer. You will need to pick the method that best suits your application.

Single Linkage: In this method, the distance between clusters is computed as the distance between its two closest members. This method often results in clusters in which individuals are added sequentially to a single group.
Complete Linkage: In this method, the distance between clusters is computed as the distance between two of their farthest-apart members. This method often results in clusters that are well separated and compact.
Average Linkage: In this method, the distance between each each group of nodes between the clusters are calculated and the overall average is reported as the cluster distance.
Centroid: In this method, the distance between two clusters is calculated as the distance between their centroids. This method should only be used for Euclidean distances. This may result in a dendrogram that is difficult to interpret.
Median: In this method, the distance between two clusters is calculated as the weighted distance between the centroids, the weights being proportion to the number of nodes in each cluster. This method should only be used for Euclidean distances. This may result in a dendrogram that is difficult to interpret.
Ward’s Distance: In this method, the groups are formed so that the pooled within-group sum of squares is minimized. The two clusters are selected to fuse if it results in the least increase in the within-group sum of squares.

Data Types

In the real-world, we will get objects that contain mixed data types. For example, let’s say we want to cluster bank clients based on the following characteristics: age, industry, job title, marital status, education level, past loan defaults status, bank balance, owns house, has kids. Some of these characteristics are numeric (like age, bank balance), some of them are categorical (like industry, job title, marital status, education level, past loan defaults status, owns house, has kids).

For numeric values, we can calculate the distance using either Euclidean or Manhattan distance.
For categorical values, we need to determine if they are binary, ordinal or nominal values.
For ordinal data types, we can first rank the variables, and then compute the distance similar to numeric values.
For nominal values, we can use dummy variables to convert them into binary variables and then use distance formula for binary variables.

In addition, if you have mixed data types, you may want to normalize the variables before you combine the distances into a single value. You can use the Gower distance to handle this situation if you are using the R software.

Advantages

It is to understand and implement.
We don’t have to pre-specify any particular number of clusters.
- Can obtain any desired number of clusters by cutting the Dendrogram at the proper level.
They may correspond to meaningful classification.
Easy to decide the number of clusters by merely looking at the Dendrogram.
Relatively straightforward to program
Main output, the dendrogram, is appealing to users
Not necessary to specify the number of clusters a-priori

Disadvantages

Hierarchical Clustering does not work well on vast amounts of data.
All the approaches to calculate the similarity between clusters have their own disadvantages.
In hierarchical Clustering, once a decision is made to combine two clusters, it can not be undone.
Involves a lot of arbitrary decisions (distance metric, linkage criteria)
Does not work very well with missing data
Computations are expensive and difficult to use with very large data sets
Works poorly with mixed data types (arbitrary decisions on distances)
Rarely provides the best solution. Easy to see the result if only 2 dimensions are present
If data is in high dimensions, we may not be aware that we have a poor solution
Different measures have problems with one or more of the following.
- Sensitivity to noise and outliers.
- Faces Difficulty when handling with different sizes of clusters.
- It is breaking large clusters.
- In this technique, the order of the data has an impact on the final results.