Objective function: SSE is the objective function for K-means. Likewise, there exists no global objective function for hierarchical clustering. It considers proximity locally before merging two clusters.
Time and space complexity: The time and space complexity of agglomerative clustering is more than K-means clustering, and in some cases, it is prohibitive.
Final merging decisions: The merging decisions, once given by the algorithm, cannot be undone at a later point in time. Due to this, a local optimisation criteria cannot become global criteria. Note that there are some advanced approaches available to overcome this problem.
There are two types of hierarchical clustering. They are agglomerative clustering and divisive clustering.
Agglomerative clustering: In this algorithm, initially every data object will be treated as a cluster. In each step, the nearest clusters will fuse together and form a bigger cluster. Ultimately, all the clusters will merge together. Finally, a single cluster, which encompasses all the data points, will remain.
Divisive clustering: This is the opposite of the agglomerative clustering. In this type, all the data objects will be considered as single clusters. In each step, the algorithm will split the cluster. This will repeat until only single data points remain, which will be considered as singleton clusters.
Handling of outliers differs from case to case. In some cases, it will provide very useful information, and in some cases, it will severely affect the results of the analysis. Having said that, let’s learn about some of the issues that arise due to outliers in the K-means algorithm below.
The centroids will not be a true representation of a cluster in the presence of outliers. The sum of squared errors (SSE) will also be very high in the case of outliers. Small clusters will bond with outliers, which may not be the true representation of the natural patterns of clusters in data. Due to these reasons, outliers need to be removed before proceeding with clustering on the data.
The algorithm for K-means algorithm is as follows:
Select initial centroids. The input regarding the number of centroids should be given by the user.
Assign the data points to the closest centroid
Recalculate the centroid for each cluster and assign the data objects again
Follow the same procedure until convergence. Convergence is achieved when there is no more assignment of data objects from one cluster to another, or when there is no change in the centroid of clusters.
Download the data set from here. Some pointers before you proceed:
Use only the following columns 'job', 'marital', 'education', 'default', 'housing', 'loan','contact','month','day_of_week','poutcome','age','duration','euribor3m' where age, duration and euriborn3m are the numerical columns.
Convert all categorical columns to numeric by using LabelEncoder()
Standardize all the columns before using K-Prototype clustering
Remember that you also need to convert the final dataframe to a matrix for applying K-Prototype.
First check K-prototype with the number of clusters as 5.
Please keep in mind that the code may take some time to execute as there are so many categorical variables, so be patient.
Q1: Check if your final data set has any missing values in it? Please remember that this question is needed to be answered after selecting the required columns as stated above in the pointers i.e. 21.
True
False
Explanation: Use df.info() and check.
Q2: What is the average "Duration" before standardising the data?
Answer: 258.285. Check your answer by running df.describe()
Q3: Run the loop to check the cost against the number of clusters ranging from 1 to 8 and identify the suitable number of clusters.( More than one answer may be correct)
Answer: 4. The answer is subjective and based on the business problem we are trying to solve.