K Means Clustering with Python (2024)

In this article, we will understand the basics of K Mean Clustering and implement it in Python using the famous Machine Learning library, Scikit-learn

K Means Clustering is an unsupervised machine learning algorithm. It takes in mixed data and divides the data into small groups/clusters based on the patterns in the data.

AudreyBu once said:

The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number of clusters (k) in a dataset.

In order to explain the working of the K Means algorithm, let's assume we have some data plotted out using a scatter graph.

Now after analyzing the data a little bit, we can see that our data can be divided into two separate groups/clusters. Now, what K Means will do is, it will also divide the data into two clusters and mark some boundary for each cluster. So, whenever some new data is fed to the model it will check that in what boundary does this data point falls into, and at the end tell us that cluster’s name or number.

First of all, we will choose the number of clusters (k)(In this case k=2). What this means is that now we will assume 2 points randomly, they will act as our cluster centroids. (Cluster centroid is the center point of a cluster).

There are two main steps in K Means Clustering:

Cluster Assignment Step: In this step, the data points that are close to the centroids will fall in those centroids clusters respectively.
Move Centroid Step: In this step, we will compute the mean of all data points in a cluster and move the centroid of that cluster to that mean position.

We will repeat the above two steps once one of the following conditions is true:

Our centroids stop changing their positions.
Maximum number of iterations are reached.

Our data is now arranged into clusters.

To choose the numbers of clusters that suits our data well, we can use the famous Elbow Method. The basic idea behind this Elbow Method is that it plots various values of cost (error) with a changing number of clusters (k). As the number of clusters (k) increases, the number of data points per cluster decreases. Hence average distortion decreases. The lesser the number of data points in a cluster, the closer these data points are to their centroids. So, the value of k where distortion declines the most is called Elbow Point.

Sometimes the slope of our graph is pretty smooth so it is hard to choose the value of k because there is no clear elbow point. In that case, we use our industry experience and continuous experimentation to determine the value of k.

Now we have a very good understanding of how K Means Clustering Algorithm works. So now we will implement K Means onto a dataset to get clearer intuition about it. For this, we will use Python’s famous Machine Learning library, Scikit-learn.

Scikit-learn (also known as sklearn) is a machine learning library for Python. It includes various classification, regression, and clustering algorithms along with support vector machines (SVM), random forests, gradient boosting, k-means and DBSCAN, and is designed to work with the Python libraries like NumPy, Pandas, and SciPy.

K Means Clustering is a very straight forward and easy to use algorithm. Especially with the help of this Scikit learn library, it’s implementation and its use has become quite easy. Now, let’s start using Sklearn.

Importing important libraries in Python

import seaborn as sns
import matplotlib.pyplot as plt

Creating Artificial Data

from sklearn.datasets import make_blobs
data = make_blobs(n_samples=200, n_features=2,centers=4, cluster_std=1.8,random_state=101)

Visualizing our Data

plt.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')

Creating Clusters

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(data[0])

Now, after importing KMeans from sklearn.cluster , we make an object kmeans of our KMeans class. Here, one thing you may find odd is that we have specified n_clusters=4 . This is because while creating the data we specified centers=4 , so we know that this data should have 4 clusters. So we’ve specified it manually. But if we didn't know about the centers, then we would have to use Elbow Method to determine the correct number of clusters.

Okay, so moving forwards now, this kmeans.fit(data[0]) piece of our code analyses the data, makes the clusters, and even fits the centroids of every cluster to their appropriate position.

Now in order to check the position of our centroids, we can use the following code.

print(kmeans.cluster_centers_)

It will print out a (4, 2) array, showing the positions of the centroids of each cluster respectively.

Comparing Original dataset VS after applying K Means

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
ax1.set_title('K Means')
ax1.scatter(data[0][:,0],data[0][:,1],c=kmeans.labels_,cmap='rainbow')
ax2.set_title("Original")
ax2.scatter(data[0][:,0],data[0][:,1])

CONGRATULATIONS! We have successfully implemented K Means Clustering on our dataset.

Up until now, we have learned what is K Means Clustering algorithm, it's working, and how to choose the value of k. Also, we have implemented K Means on a dataset using Python's famous Machine Learning library, i.e Scikit-learn.