As we know that in real-world, data are of different types such as numerical and categorical data. The clustering algorithm commonly used in clustering techniques and efficiently used for large data is k-Means but it only works for numerical data. So, Huang made an algorithm called k-Modes which is created to handle clustering algorithms with the categorical data.
K-Modes approach modifies the standard process of K-Means which is used for clustering categorical data by just replacing Euclidean distance function with the simple matching dissimilarity measure, which is used to represent the cluster centers, and updating modes with the most frequent categorical values in each of iterations of the clustering process. This process guarantee that the clustering process converges to a local minimal result. Here the number of nodes is equal to the number of clusters required since they act as centroids. Hamming distance from information theory act as a dissimilarity metric for K-Modes. It uses the dissimilarities between the data points. The lesser the dissimilarities the more similar our data points are. The mode of an attribute is either “1” or “0,” whichever is more common in the cluster. The mode vector minimizes the sum of the distances between each object in the cluster and the cluster center.
- Randomly select K observation which can be used as cluster.
- Now calculate the dissimilarities and assign each observation to its closest cluster.
- Repeat until all objects are assigned to clusters.
- Then select a new mode for each cluster and compare it with the previous mode. If different, go back to Step 2; otherwise, stop.