Clustering Interview Questions

Displaying 1 - 10 of 12

Is validation required for clustering? If yes; then why is it required?

03/29/2021 - 03:24 by devraj

Clustering algorithms have a tendency to cluster even when the data is random. It is essential to validate if a non-random structure is present in the data. It is also required to validate whether the number of clusters formed is appropriate or not.

Evaluation of clusters is done with or without external reference to check the fitness of the data. Evaluation is also done to compare clusters and decide the better among them.

What are the disadvantages of agglomerative hierarchical clustering?

03/29/2021 - 03:22 by devraj

Objective function: SSE is the objective function for K-means. Likewise, there exists no global objective function for hierarchical clustering. It considers proximity locally before merging two clusters.

Time and space complexity: The time and space complexity of agglomerative clustering is more than K-means clustering, and in some cases, it is prohibitive.

Final merging decisions: The merging decisions, once given by the algorithm, cannot be undone at a later point in time. Due to this, a local optimisation criteria cannot become global criteria. Note that there are some advanced approaches available to overcome this problem.

What are the types of hierarchical clustering?

03/29/2021 - 03:20 by devraj

There are two types of hierarchical clustering. They are agglomerative clustering and divisive clustering.

Agglomerative clustering: In this algorithm, initially every data object will be treated as a cluster. In each step, the nearest clusters will fuse together and form a bigger cluster. Ultimately, all the clusters will merge together. Finally, a single cluster, which encompasses all the data points, will remain.

Divisive clustering: This is the opposite of the agglomerative clustering. In this type, all the data objects will be considered as single clusters. In each step, the algorithm will split the cluster. This will repeat until only single data points remain, which will be considered as singleton clusters.

Is K-means clustering suitable for all shapes and sizes of clusters?

03/29/2021 - 03:19 by devraj

K-means is not suitable for all shapes, sizes, and densities of clusters. If the natural clusters of a dataset are vastly different from a spherical shape, then K-means will face great difficulties in detecting it. K-means will also fail if the sizes and densities of the clusters are different by a large margin. This is mostly due to using SSE as the objective function, which is more suited for spherical shapes. SSE is not suited for clusters with non-spherical shapes, varied cluster sizes, and densities.

What is the objective function for measuring the quality of clustering in case of the K-means algorithm with Euclidean distance?

03/29/2021 - 03:17 by devraj

Sum of squared errors (SSE) is used as the objective function for K-means clustering with Euclidean distance. The Euclidean distance is calculated from each data point to its nearest centroid. These distances are squared and summed to obtain the SSE. The aim of the algorithm is to minimize the SSE. Note that SSE considers all the clusters formed using the K-means algorithm.

How are outliers handled by the K-means algorithm?

03/29/2021 - 03:15 by devraj

Handling of outliers differs from case to case. In some cases, it will provide very useful information, and in some cases, it will severely affect the results of the analysis. Having said that, let’s learn about some of the issues that arise due to outliers in the K-means algorithm below.

The centroids will not be a true representation of a cluster in the presence of outliers. The sum of squared errors (SSE) will also be very high in the case of outliers. Small clusters will bond with outliers, which may not be the true representation of the natural patterns of clusters in data. Due to these reasons, outliers need to be removed before proceeding with clustering on the data.

What are the different proximity functions or distance metrics used for the K-means algorithm?

03/29/2021 - 02:56 by devraj

Euclidean, Manhattan, Cosine, and Bregman divergence are some distance metrics used for the K-means algorithm. Euclidean is the squared distance from a data point to the centroid. Manhattan is the absolute distance from a data point to the centroid. Cosine is the cosine distance from a data point to the cluster centroid. Bregman divergence is a class of distance metrics that includes Euclidean, Mahalanobis, and Cosine. Basically, Bregman divergence includes all those distance metrics for which the mean is a centroid.

Explain the steps of K-means Clustering algorithm. Mention the key steps that need to be followed and how the algorithm works.

03/29/2021 - 02:52 by devraj

The algorithm for K-means algorithm is as follows:

  • Select initial centroids. The input regarding the number of centroids should be given by the user.
  • Assign the data points to the closest centroid
  • Recalculate the centroid for each cluster and assign the data objects again
  • Follow the same procedure until convergence. Convergence is achieved when there is no more assignment of data objects from one cluster to another, or when there is no change in the centroid of clusters.

K-Modes clustering: Bank Marketing Data

03/29/2021 - 02:20 by devraj

Download the data set from here. Some pointers before you proceed:

  • Use only the following columns 'job', 'marital', 'education', 'default', 'housing', 'loan','contact','month','day_of_week','poutcome','age','duration','euribor3m' where age, duration and euriborn3m are the numerical columns.
  • Convert all categorical columns to numeric by using LabelEncoder()
  • Standardize all the columns before using K-Prototype clustering
  • Remember that you also need to convert the final dataframe to a matrix for applying K-Prototype.
  • First check K-prototype with the number of clusters as 5.
  • Please keep in mind that the code may take some time to execute as there are so many categorical variables, so be patient.

Q1: Check if your final data set has any missing values in it? Please remember that this question is needed to be answered after selecting the required columns as stated above in the pointers i.e. 21.

  • True
  • False

Explanation: Use df.info() and check.

Q2: What is the average "Duration" before standardising the data? 

Answer: 258.285. Check your answer by running df.describe()

Q3: Run the loop to check the cost against the number of clusters ranging from 1 to 8 and identify the suitable number of clusters.( More than one answer may be correct)

Answer: 4. The answer is subjective and based on the business problem we are trying to solve.

Download the solution from here.