A Practical Guide on K-Means Clustering (2024)

Going beyond the theory and getting the best out of your K-Means clustering algorithm

Dhruvil Karani

Published in

Towards Data Science

8 min read

Mar 27, 2022

A Practical Guide on K-Means Clustering (3)

What is KMeans?
a. Python implementation
Things to know before using KMeans
a. K-Means cannot handle non-globular structure
b. K-Means is sensitive to outliers
c. Should you scale your data before using KMeans?
How do you measure the performance of your model?

Note — The datasets used in the article are generated using sklearn’s make_blobs and make_moons methods

K-Means divides the dataset into k (a hyper-parameter) clusters using an iterative optimization strategy. Each cluster is represented by a centre. A point belongs to a cluster whose centre is closest to it. For simplicity, assume that the centres are randomly initialized.

The goal of the model is to find clusters that minimize the sum of SSE over k clusters by shifting their centres. SSE or Sum of Squared Errors of a cluster is the sum of squared distances between its centre and its points.

Using calculus, one can prove that the best way to minimize the SSE is to move the cluster centre to the centroid (=average) of all points in the cluster. We re-assign the points to the clusters based on the same closest centre strategy with the updated centres.

Repeat the process until the centres no longer move too much and converge.

K-Means summary
------------------------------
X -> dataset of N points and M features ((N,M) matrix)
k -> Number of chosen centers (hyper-parameter)
centers -> randomly initialized k centers ((k,M) matrix)
for i in n_steps:
 - assign each point to a cluster based on the nearest center. points in jth cluster are denoted by X_j
 - updated_centers -> randomly initialized k centers ((k,M) matrix) - for j in [1, 2, .. k]
 - updated_center[j] = mean(X_j) - if distance(updated_centers, centers) is small
 - exit
 - else
 - centers = updated_centers
 - continue
See Also
K Means Clustering in Python | Step-by-Step Tutorials for Clustering in Data Analysis

Python Implementation

Most ML models have assumptions about the data they fit on. It is essential to check these before inferring anything from a trained model. For K-Means, here they are —

K-Means cannot handle non-globular structure

Datasets can have any number of patterns that can be interpreted visually. The job of clustering algorithms is to be able to capture this information. Different algorithms use different strategies. Prototype-based algorithms like K-Means use centroid as a reference (=prototype) for each cluster. Density-based algorithms like DBSCAN use the density of data points to form clusters.

Consider the two datasets below —

Globular (Globe like)

A Practical Guide on K-Means Clustering (4)

Non-Globular

A Practical Guide on K-Means Clustering (5)

K-Means would capture better structural semantics for the globular data. This is evident from how K-Means is fitted on data. We know that K-Means does the following

Each cluster has a centroid. A point belongs to a cluster with the closest centroid. K-Means minimizes the sum of SSE by optimally iteratively moving the centroids.

In a way, K-means works by creating a hard partition in the dataset, which act as the cluster boundaries. For example —

A Practical Guide on K-Means Clustering (6)

In this example, with five clusters, the centre one (C1) is flanked by the rest four on all sides. The four-sided enclosure is the decision boundary for any point to be or not to be placed in cluster-1, based on the closest centroid.

For a new point, cluster assignment is solely dependent on the distance from the centroids. The position of the point does not matter as long as it is closest to a particular centroid. This works well in a globular structure. Recall that the boundary of a circle is the locus of point having a fixed distance from the centre.

For contrast, let’s look at how K-Means works out in a non-globular setting.

A Practical Guide on K-Means Clustering (7)

KMeans is sensitive to outliers

Since K-Means is a distance-based algorithm, it is susceptible to outliers. At every update step, the centroids are re-computed by averaging the points in a cluster. Averages, as we know, can be sensitive to outliers. For example

A Practical Guide on K-Means Clustering (8)

How to solve this issue?

Remember that outliers contribute disproportionately to the update step. You can spot outliers by plotting the density of centroid-to-points distances. In the above example, the distribution of distances of points in cluster2 to center2 look like —

A Practical Guide on K-Means Clustering (9)

You can remove this point from the dataset to get the actual clusters.

Another way is to increase the number of clusters hoping that outliers can form a cluster of their own. In the above case (Fig. 5.), if we set the number of clusters=3, we get

A Practical Guide on K-Means Clustering (10)

This reflects in the elbow plot (discussed above). Adding the third cluster drastically reduces the sum of intracluster variances. Hence, one would notice a sharp drop in the plot.

Should you scale your data before using KMeans?

Many ML algorithms benefit from scaling the features using methods like min-max scaling or standard scaling. The ‘benefit’ is measured by an increase in the metric.

How does scaling affect KMeans? How do we know if it would be good or not? Let’s understand what scaling does to our model.

If we have two features, X1, X2. The range of X1 is -1 to 1, and X2 is -100 to 100. While computing the intracluster variances, X2 will contribute more to the SSE than X1. Hence, the model can minimize this SSE more by minimizing the contribution from X2. This won’t happen if you use standard scaling where you transform the feature as —

X_transformed = (X-X_mean)/X_std_dev

Let’s look at examples!

Scaled features —

A Practical Guide on K-Means Clustering (11)

Same data, but features are unscaled —

A Practical Guide on K-Means Clustering (12)

Do you want a small set of features to dominate your cluster analysis, or do you want all of your features to have the same contribution?

The unavailability of labels limits what you can say about a clustering model. But it’s not all gone. We mainly want to look at how well the clusters are formed in clustering. The definition of ‘well’ is not very precise.

Ideally, we want the clusters to be well separated, large but dense.

You can get the largest cluster by setting k (the number of clusters) as small as one. You can get the densest clusters by using a large k , creating many dense micro-clusters. We lose any insightful information we might want to infer from the cluster in both cases.

Luckily, we can measure largeness, density and separation.

Largeness --> number of points in the clusters.
Density --> Average of distance of two points in a cluster.
Separation --> overlap between two cluster.

Consider a KMeans model being fitted on a dataset. How do we decide the number of clusters? Remember that KMeans minimizes the SSE across clusters. As we increase k, our SSE across clusters would decrease. At k=number of points in the dataset, the SSE will be 0 for all clusters because each point will be its own cluster and its own centroid. However, this is not useful. Therefore, we want to increase k, but only to a point where the drop in SSE on further increment is marginal —

A Practical Guide on K-Means Clustering (13)

We see that the drop in SSE after k=5 is marginal. Therefore, we may choose 5 clusters.

Are our clusters large enough? Let’s check —

cluster index number of points
-------------------------------- 0 5001
 1 5001
 2 9952
 3 5047
 4 4999

Seems so. Each cluster has a considerable number of points. It is tempting to choose k=5 and move on in life. However, we haven’t checked for the degree of separation. You can have significant and dense clusters that overlap. We don’t want that.

Let me introduce you to the silhouette score. It is a metric that tells you how much overlap exists between clusters. Ranging between -1 and 1, the larger the score, the lesser the overlap. It is calculated for each point p in the dataset by computing two measures —

a = Average intra-cluster distance of p with all the points in the same cluster . .
b = Average distance of p with any cluster that is not the one p belongs to. If there are N clusters, we get N-1 such averages. Take the minimum of these and call it b.

silhouette-score for p = (b - a)/max(b, a)

To get a score on a cluster level, average the scores of each point in the cluster.

Let’s look at it intuitively. If two clusters overlap, many overlapping points will have lower b and vice-versa. Lower b means a lower score. If a=0, then the score is 1. This only happens when there is a single point in the cluster. If the above expression is unclear, refer to the visualization below —

A Practical Guide on K-Means Clustering (14)

Plot silhouette score vs number of clusters —

A Practical Guide on K-Means Clustering (15)

We can assure that k=5 yields better clusters in size, density, and separation. Note that these measures focus more on the distribution of the embedding space. The semantics of the cluster depends on the application. For example, inspecting for topic dominance in clusters of documents.

K-Means is a powerful tool to dig down into a complex dataset and find patterns using euclidean mathematics. Make sure you follow the above guidelines to avoid any inaccuracies. Hope you found them useful.

If you liked the article, do subscribe to my list and hit the follow button on your right. It would mean a lot to me!

Good day :)