Hierarchical Clustering in Python

Clustering is a technique of grouping similar data points together and the group of similar data points formed is known as a Cluster. There are often times when we don’t have any labels for our data; due to this, it becomes very difficult to draw insights and patterns from it. Unsupervised Clustering techniques come into play during such situations. In hierarchical clustering, we basically construct a hierarchy of clusters.

Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset. It does not require us to pre-specify the number of clusters to be generated as is required by the k-means approach. It has an added advantage over K-means clustering in that it results in an attractive tree-based representation of the observations, called a dendrogram.

Every country begins in a separate cluster At each step, the two closest clusters are merged and Continue until all countries in a single cluster. This is "agglomerative" hierarchical clustering. An implementation of hierarchical clustering is provided in the SciPy package. Among other things, it allows to build clusters from similarity matrices and make dendrogram plots.

Hierarchical clustering can be divided into two main types: 

  • agglomerative
  • divisive 

Agglomerative clustering is good at identifying small clusters. Divisive hierarchical clustering is good at identifying large clusters

Agglomerative clustering

It’s also known as AGNES (Agglomerative Nesting). It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are member of just one single big cluster (root) (see figure below). The result is a tree which can be plotted as a dendrogram. Agglomerative Hierarchical Clustering is the technique, where we start treating each data point as one cluster. Then we start clustering data points that are close to one another. We repeat this process until we form one big cluster.

The steps in hierarchical clustering are:

  1. Calculate the NxN distance (similarity) matrix, which calculates the distance of each data point from the other.
  2. Each item is first assigned to its own cluster, i.e. N clusters are formed.
  3. The clusters which are closest to each other are merged to form a single cluster.
  4. The same step of computing the distance and merging the closest clusters is repeated till all the points become part of a single cluster.

Divisive hierarchical clustering

It’s also known as DIANA (Divise Analysis) and it works in a top-down manner. The algorithm is an inverse order of AGNES. It begins with the root, in which all objects are included in a single cluster. At each step of iteration, the most heterogeneous cluster is divided into two. The process is iterated until all objects are in their own cluster. Involves the top-down approach. We start with 1 big cluster and subsequently keep on partitioning this cluster to reach n clusters, each containing 1 element, it is called divisive clustering.

One of the biggest challenges with K Means is that we need to know K value beforehand. The hierarchical clustering algorithm does not have this restriction. The hierarchical clustering algorithm groups together the data points with similar characteristics. 

Plotting and creating Clusters

sklearn.cluster module provides us with Agglomerative clustering class to perform clustering on the dataset.

As an input argument, it requires a number of clusters (n_clusters), affinity which corresponds to the type of distance metric to use while creating clusters, linkage{“ward”, “complete”, “average”, “single”}, default=”ward”.

The linkage criterion determines which distance to use between the given sets of observations.

Plotting Dendrogram

The scipy.cluster module contains the hierarchy class which we’ll make use of to plot Dendrogram. The hierarchy class contains the dendrogram method and the linkage method .

The lingage module takes the dataset and the method to minimize distances as parameters i.e. ward and returns a linkage matrix which when provided to dendrogram method creates Dendrogram of the fitted data.