Introduction to Clustering in Machine Learning

Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. Due to historical reasons, clustering is often considered synonymous with unsupervised learning. In fact, association rule mining is also unsupervised.

  • The process of finding groups in data.
  • The process of dividing the data into homogeneous groups.
  • The process of dividing the data into groups, where points within each group are close (or similar) to each other.
  • The process of dividing the data into groups, where points within each group are close (or similar) to each other, and where points of different groups are far (or dissimilar) from each other. The process of dividing the feature space into regions with relatively high density of points, separated by regions with relatively low density of points.

Clustering Quality

The quality of a clustering result depends on the algorithm, the distance function, and the application.

  1. Inter-clusters distance = maximized
  2. Intra-clusters distance = minimized

Clustering Methods

The clustering methods are broadly divided into Hard clustering (data point belongs to only one group) and Soft Clustering (data points can belong to another group also).

  • Hard clustering : each object belongs to a cluster or not
  • Soft clustering : each object belongs to each cluster to a certain degree 

Clustering techniques 

1.) Partitioning Clustering : algorithm typically determines all clusters at once, but can also be used as divisive algorithms in Hierarchical clustering 

  1. Centroid 
  2. Model based 
  3. Graph Theoretic 
  4. Spectral 

2.) Hierarchical Clustering:  A Hierarchical clustering method find successive cluster using previous established clusters.

  • Agglomerative hierarchical clustering : In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters.
  • Divisive Hierarchical clustering : In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters.

3.) Bayesian Clustering  algorithms try to generate a posteriori distribution over the collection of all partitions of the data.

  1. Decision Based 
  2. Non-parametric 

    Clustering Algorithms: 

    1. Hierarchical vs. Partitional Methods: Hierarchical clustering algorithms induce on the data a clustering structure parameterized by a similarity parameter. Once the learning phase ends, the user can then obtain immediately different data clusterings by specifying different values of the similarity index. Partitional methods essentially produce one partition of the data into clusters.
    2. Agglomerative vs. Divisive Methods: Agglomerative methods start by assigning each sample to its own cluster, and proceed by merging clusters. Divisive methods start by assigning all the samples to a unique claster, and proceed by splitting clusters.
    3. Monothetic vs. Polythetic Methods: Monothetic methods learn clusters using one feature at a time. Polythetic methods use collections of features (e.g., by combining them linearly). The vast majority of clustering methods are polythetic, although people have devised (mostly unsuccessful) monothetic methods.
    4. Hard vs. Fuzzy: In hard clustering each sample belong to one and only one cluster (once a strategy of how to assign samples that lie on cluster boundaries have been decided.) In fuzzy clustering, samples have different degrees of membership to different clusters. In general, membership functions are normalized so that the sum of the degrees of membership of a point to clusters sum to 1. For example, say that we cluster skies into sunny and cloudy. A sky that is 30% covered by clouds could be characterized as a 70% sunny - 30% cloudy sky.
    5. Deterministic vs. Probabilistic Clusters: If clusters are deterministic, a point either belongs to a cluster or does not belong to it. If clusters are probabilistic, a point belongs to a certain cluster with a certain probability. An example of probabilistic clusters are the components of a Gaussian mixture. Since each component is Gaussian, it has infinite size, and overlaps all the other components of the mixture. In fuzzy clusters, a point can belong to multiple clusters. Also, probabilities and fuzzy membership functions are manipulated differently: probabilities rely on sums and products (the probability of the union of disjoint events is the product of the probabilities, the probability of the intersection of independent events is the product of the probabilities, etc. etc.), while usually the operation applied to fuzzy membership functions are the maximum and the minimum.
    6. Deterministic vs. Stochastic Algorithms: If all the steps in the clustering algorithm are deterministic, the method is deterministic. Some methods use randomized steps, and fall in the other category.
    7. Incremental vs. Non-Incremental: Some methods need all the sample points from the very beginning. Other can be initialized with fewer samples, and incrementally refined. The latter methods are sometimes better suited for large datasets.
    8. K-Means algorithm: It classifies the dataset by dividing the samples into different clusters of equal variances. The number of clusters must be specified in this algorithm.  
    9. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data points. It is an example of a centroid-based model, that works on updating the candidates for centroid to be the center of the points within a given region.
    10. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an example of a density-based model similar to the mean-shift, but with some remarkable advantages. In this algorithm, the areas of high density are separated by the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
    11. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data points are Gaussian distributed.
    12. Affinity Propagation: Each data point sends a message between the pair of data points until convergence. 


    Let us see some real-life examples

    Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts. Tailor-made for each person: too expensive. One size fits all: does not fit all. 

    Example 2: In marketing, segment customers according to their similarities To do targeted marketing

    Example 3: Given a collection of text documents, we want to organize them according to their content similarities, To produce a topic hierarchy

    In fact, clustering is one of the most utilized data mining techniques. It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc.


    Clustering has a long history and still is in active research

    • There are a huge number of clustering algorithms, among them: Density based algorithm, Sub-space clustering, Scale-up methods, Neural networks based methods, Fuzzy clustering, Co-clustering …
    • More are still coming every year
    • Clustering is hard to evaluate, but very useful in practice 
    • Clustering is highly application dependent (and to some extent subjective)
    • Competitive learning in neuronal networks performs clustering analysis of the input data