Introduction to K-Mode Clustering

K-Modes technique, towards cluster categorical data. Clustering technique can be generally classified into two groups: hierarchical, partitioning clustering. Hierarchical algorithm can be further divided into bottom-up and top-down algorithms and partitioning clustering divided into k-mean and k-modes algorithms.

Kmeans clustering works efficiently only for numerical dataset. We don’t get proper results for the categorical data because of the improper spatial representation. K-Means Clustering fails to find patterns in the categorical dataset. Hence, comes in picture KModes Clustering.

The k-modes algorithm as an extension to k-means for categorical data, by replacing kmeans with k-modes, introduce a different dissimilarity measure and update the modes with a frequency based method. k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.)

Install kmodes using pip

kmodes can be installed using pip:

pip install kmodes

To upgrade to the latest version (recommended), run it like this:

pip install upgrade kmodes

The k-modes algorithm accepts np.NaN values as missing values in the X matrix. However, users are strongly suggested to consider filling in the missing data themselves in a way that makes sense for the problem at hand. This is especially important in case of many missing values.

The k-modes algorithm currently handles missing data as follows. When fitting the model, np.NaN values are encoded into their own category (let's call it "unknown values"). When predicting, the model treats any values in X that it has not seen before during training are missing, as being a member of the "unknown values" category. Simply put, the algorithm treats any missing / unseen data as matching with each other but mismatching with non-missing / seen data when determining similarity between points.

The k-prototypes also accepts np.NaN values as missing values for the categorical variables, but does not accept missing values for the numerical values. It is up to the user to come up with a way of handling these missing data that is appropriate for the problem at hand.

The k-modes implementation offer support for multiprocessing using the n_jobs parameter. It generally does not make sense to set more jobs than there are processor cores available on your system.

This potentially speeds up any execution with more than one initialization try, n_init > 1, which may be helpful to reduce the execution time for larger problems. Note that it depends on your problem whether multiprocessing actually helps, so be sure to try that out first.

The k-modes clustering algorithm is an extension of k-means clustering algorithm. The k-means algorithm is the most widely used centre based partitional clustering algorithm. Huang extends the k-means clustering algorithm to k-modes clustering algorithm to group the categorical data.

The modifications done in the k-means are -

(i) using a simple matching dissimilarity measure for categorical objects,

(ii) replacing means of clusters by modes, and

(iii) using a frequency-based method to update the modes.