Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (2024)

  • Veer Kumar
  • Last Updated On February 27, 2022
  • Python

Table of Contents

Introduction:

In this tutorial, we will learn how to apply the K-means clustering in Sklearn library. We will first have a brief overview of what is meant by clustering, followed by understanding what the K-means algorithm is. Then, we will go through the working principle of the K-means algorithm, after which we shall implement and end to end code in which we shall implement this algorithm to perform customer segmentation using the ‘Mall_Customers.csv’ dataset. We will then look at the two methods namely the Elbow and Silhouette methods, by which we can calculate the optimum number of clusters within a given dataset.

What is Clustering?

Clustering is the task of segmenting a set of data into distinct groups such that the data points in the same group will bear similar characteristics as opposed to those data points which lie in the groups/clusters. Our main objective here is to segregate groups having similar characteristics assign them unique clusters.

  • The points present in the same cluster should have similar properties
  • The points present in the different clusters should be as dissimilar as possible

What is K-Means Algorithm?

K-Means Clustering comes under the category of Unsupervised Machine Learning algorithms, these algorithms group an unlabeled dataset into distinct clusters. The K defines the number of pre-defined clusters that need to be created, for instance, if K=2, there will be 2 clusters, similarly for K=3, there will be three clusters. The primary goal while implementing k-means involves defining k clusters such that total within-cluster variation (or error) is minimum.

The cluster center is the arithmetic mean of all the data points that belong to that cluster. The squared distance between every given point and its cluster center is called variation. The goal of the k-means clustering is to ascertain these k clusters and their centers whilst reducing the total error.

Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (1)

How does the K-Means Algorithm Work?

The steps of the underlying working principle that govern the K-Means Algorithm have been enlisted below:

Step-1:To decide the number of clusters, we select an appropriate value of K.

Step-2: Now choose random K points/centroids.

Step-3: Each data point will be assigned to its nearest centroid and this will form a predefined cluster.

Step-4: Now we shall calculate variance and position a new centroid for every cluster.

Step-5: The 3rd step will be repeated, meaning, every data point will be assigned to the new nearest centroid.

Step-6: If a reassignment has occurred then step-4 shall be executed otherwise execution finishes.

Step-7: Finally, the model is ready

Example of K Means Clustering in Python Sklearn

We can easily implement K-Means clustering in Python with Sklearn KMeans() function of sklearn.cluster module. For this example, we will use the Mall Customer dataset to segment the customers in clusters based on their Age, Annual Income, Spending Score, etc.

Import Libraries

Let us import the important libraries that will be required by us.

In[1]:

from sklearn.cluster import KMeansfrom sklearn import preprocessingimport sklearn.cluster as clusterimport sklearn.metrics as metricsimport pandas as pdfrom sklearn.preprocessing import MinMaxScalerimport seaborn as snsfrom matplotlib import pyplot as plt%matplotlib inline

Load Dataset

Let us load the dataset into a dataframe and take a look at some of its rows. Also with the shape function, it can be seen that it has 4 columns and 200 rows.

In[2]:

df = pd.read_csv(r"Mall_Customers.csv")df.head()

Out[2]:

GenderAgeAnnual Income (k$)Spending Score (1-100)
0Male191539
1Male211581
2Female20166
3Female231677
4Female311740

In [3]:

df.shape

Out[3]:

(200, 4)

Objective

Customer segmentation deals with grouping clusters together based on some common patterns within their attributes. To keep the example simple and to visualize the clustering on a 2-D graph we will use only two attributes Annual Income and Spending Score. Don’t worry we will also show you after this how you can use more than 2 attributes for clustering and still visualize the results in 2-D with the help of Principal Component Analysis (PCA).

Apply Feature Scaling

Clustering algorithms like K-means require feature scaling of the data as part of data preprocessing to produce good results. This is because clustering techniques use distance calculation between the data points. Hence it is proper to bring data of different units under a common scale.

  • For more details, you may read following article –

Sklearn Feature Scaling with StandardScaler, MinMaxScaler, RobustScaler and MaxAbsScaler

We have used MinMaxScaler for our example as shown below. In the new dataframe df_scale it can been seen that both attributes Annual Income and Spending Score are normalized.

In [4]:

scaler = MinMaxScaler()scale = scaler.fit_transform(df[['Annual Income (k$)','Spending Score (1-100)']])df_scale = pd.DataFrame(scale, columns = ['Annual Income (k$)','Spending Score (1-100)']);df_scale.head(5)

Out[4]:

Annual Income (k$)Spending Score (1-100)
00.0000000.387755
10.0000000.816327
20.0081970.051020
30.0081970.775510
40.0163930.397959

Applying Kmeans with 2 Clusters (K=2)

Let us see how to apply K-Means in Sklearn to group the dataset into 2 clusters (0 and 1). The output shows the cluster (0th or 1st) corresponding to the data points in the dataset.

In [5]:

km=KMeans(n_clusters=2)y_predicted = km.fit_predict(df[['Annual Income (k$)','Spending Score (1-100)']])y_predicted

Out[5]:

array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

To find the centroid of the two clusters the attribute cluster_centers_ can be used as shown below

In [6]:

km.cluster_centers_

Out[6]:

array([[0.36929553, 0.31163817], [0.37861485, 0.73950929]])

Finally, let us visualize the results. From the graph, it is evident that there is a scope for data to be grouped into more clusters than only 2. But how to know how many clusters? Let us understand this in the next section.

In [7]:

df['Clusters'] = km.labels_sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",hue = 'Clusters', data=df,palette='viridis')

Out[7]:

<AxesSubplot:xlabel='Spending Score (1-100)', ylabel='Annual Income (k$)'>

Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (2)

Finding Optimum number of Clusters in K Means

The tricky part with K-Means clustering is you do not know in advance that in how many clusters the given data can be divided (hence it is an unsupervised learning algorithm). It can be done with the trial and error method but let us see a more proper technique for this.

i) Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)

The Elbow Method is a popular technique for determining the optimal number of clusters. Here, we calculate the Within-Cluster-Sum of Squared Errors (WCSS) for various values of k and choose the k for which WSS first starts to diminish. In the plot of WSS-versus-k, this can be observed as an elbow.

  • The Squared Error for a data point is the square of the distance of a point from its cluster center.
  • The WSS score is the summation of Squared Errors for all given data points.
  • Distance metrics like Euclidean Distance or the Manhattan Distance can be used.

Continuing with our example, we calculate the WCSS for K=2 to k=12 and calculate the WCSS in each iteration.

In [8]:

K=range(2,12)wss=[]for k in K: kmeans=cluster.KMeans(n_clusters=k) kmeans=kmeans.fit(df_scale) wss_iter=kmeans.inertia_ wss.append(wss_iter)

Let us now plot the WCSS vs K cluster graph. It can be seen below that there is an elbow bend at K=5 i.e. it is the point after which WCSS does not diminish much with the increase in value of K.

In [9]:

plt.xlabel('K')plt.ylabel('Within-Cluster-Sum of Squared Errors (WSS)')plt.plot(K,wss)

Out[9]:

Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (3)

ii) The Silhouette Method

The silhouette value measures the similarity of a data point within its cluster. It has a range between +1 and -1 and the higher values denote a good clustering.

Below we calculate the Silhouette Score for k=2 to 12 and it can be seen that the maximum value is for k=5. This is in line with the elbow method.

In [10]:

import sklearn.cluster as clusterimport sklearn.metrics as metricsfor i in range(2,13):labels=cluster.KMeans(n_clusters=i,random_state=200).fit(df_scale).labels_print ("Silhouette score for k(clusters) = "+str(i)+" is "+str(metrics.silhouette_score(df_scale,labels,metric="euclidean",sample_size=1000,random_state=200)))

Out[10]:

Silhouette score for k(clusters) = 2 is 0.33340205479521Silhouette score for k(clusters) = 3 is 0.4514909309424474Silhouette score for k(clusters) = 4 is 0.49620078745146784Silhouette score for k(clusters) = 5 is 0.5594854531227246Silhouette score for k(clusters) = 6 is 0.5380652777999084Silhouette score for k(clusters) = 7 is 0.43787929453711455Silhouette score for k(clusters) = 8 is 0.43074523601514214Silhouette score for k(clusters) = 9 is 0.4421331695270676Silhouette score for k(clusters) = 10 is 0.44563877983976935Silhouette score for k(clusters) = 11 is 0.44293254541345917Silhouette score for k(clusters) = 12 is 0.4427512711673661

Applying Kmeans with 5 Clusters (K=5)

Now that we have identified that the optimum value of K is 5

In [11]:

# We will use 2 Variables for this examplekmeans = cluster.KMeans(n_clusters=5 ,init="k-means++")kmeans = kmeans.fit(df[['Annual Income (k$)','Spending Score (1-100)']])

Finally, let us plot the graph with k=5 clusters and we can see that now the K-Means result looks good.

In [12]:

df['Clusters'] = kmeans.labels_sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",hue = 'Clusters', data=df,palette='viridis')

Out[12]:

Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (4)

K Means Clustering in Python Sklearn with Principal Component Analysis

In the above example, we used only two attributes to perform clustering because it is easier for us to visualize the results in 2-D graph. We cannot visualize anything beyond 3 attributes in 3-D and in real-world scenarios there can be hundred of attributes. So how can we visualize the clustering results?

Well, it can be done by applying principal component analysis (PCA) on the dataset to reduce its dimension to only two while still preserving the information. And then clustering can be applied to this transformed dataset and then visualized in a 2-D plot. Moreover, PCA can also help to avoid the curse of dimensionality.

  • For more details on PCA, you can read the following article –

Complete Tutorial of PCA in Python Sklearn with Example

So let us see this practically below where we will use 3 attributes on the same dataset.

Load Dataset

Let us again load the dataset in the dataframe like before.

In [13]:

df = pd.read_csv(r"Mall_Customers.csv")df.head()

Out[13]:

GenderAgeAnnual Income (k$)Spending Score (1-100)
0Male191539
1Male211581
2Female20166
3Female231677
4Female311740

Apply Feature Scaling

This time we are applying feature scaling on our desired columns Age, Annual Income and Spending Score.

In [14]:

scaler = MinMaxScaler()scale = scaler.fit_transform(df[['Age','Annual Income (k$)','Spending Score (1-100)']])df_scale = pd.DataFrame(scale, columns = ['Age','Annual Income (k$)','Spending Score (1-100)']);df_scale.head(5)

Out[14]:

AgeAnnual Income (k$)Spending Score (1-100)
00.0192310.0000000.387755
10.0576920.0000000.816327
20.0384620.0081970.051020
30.0961540.0081970.775510
40.2500000.0163930.397959

Applying PCA

Now let us reduce the dimensionality of the dataset into two components.

In [15]:

from sklearn.decomposition import PCApca = PCA(n_components=2)principalComponents = pca.fit_transform(df_scale)pca_df = pd.DataFrame(data = principalComponents , columns = ['principal component 1', 'principal component 2'])pca_df.head()

Out[15]:

principal component 1principal component 2
0-0.1922210.319683
1-0.458175-0.018152
20.0525620.551854
3-0.402357-0.014239
4-0.0316480.155578

Finding Optimum Value of K

i) Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)

Let us again use the elbow method with Within-Cluster-Sum of Squared Error (WCSS) to determine the optimum value of K. From the graph it looks like there is a bend between 5 and 6.

In [16]:

K=range(2,12)wss = []for k in K: kmeans=cluster.KMeans(n_clusters=k) kmeans=kmeans.fit(pca_df) wss_iter = kmeans.inertia_ wss.append(wss_iter)

In [17]:

plt.xlabel('K')plt.ylabel('Within-Cluster-Sum of Squared Errors (WSS)')plt.plot(K,wss)

Out[17]:

Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (5)

ii) The Silhouette Method

Using the Silhouette method, it can be seen that the Silhouette value is maximum for K=5. Hence it can be concluded that the dataset can be segmented properly with 6 clusters.

In[18]:

import sklearn.cluster as clusterimport sklearn.metrics as metricsfor i in range(2,12): labels=cluster.KMeans(n_clusters=i,random_state=200).fit(pca_df).labels_ print ("Silhouette score for k(clusters) = "+str(i)+" is " +str(metrics.silhouette_score(pca_df,labels,metric="euclidean",sample_size=1000,random_state=200)))

Out[18]:

Silhouette score for k(clusters) = 2 is 0.4736269407502857Silhouette score for k(clusters) = 3 is 0.44839082753844756Silhouette score for k(clusters) = 4 is 0.43785291876777566Silhouette score for k(clusters) = 5 is 0.45130680489606634Silhouette score for k(clusters) = 6 is 0.4507847568968469Silhouette score for k(clusters) = 7 is 0.4458795480456887Silhouette score for k(clusters) = 8 is 0.4132957148795121Silhouette score for k(clusters) = 9 is 0.4170428610065107Silhouette score for k(clusters) = 10 is 0.4309783655094101Silhouette score for k(clusters) = 11 is 0.42535265774570674

Applying Kmeans with 5 Clusters (K=5)

K Mean clustering is applied with SKlearn KMeans() by passing the value of k=5

In [19]:

kmeans = cluster.KMeans(n_clusters=5)kmeans = kmeans.fit(pca_df)

On visualizing the clustering it can be seen that it has produced good results.

In [20]:

pca_df['Clusters'] = kmeans.labels_sns.scatterplot(x="principal component 1", y="principal component 2",hue = 'Clusters', data=pca_df,palette='viridis')

Out[20]:

Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (6)

Conclusion

We hope you liked our tutorial and now better understand how to implement K-means clustering using Sklearn(Scikit Learn) in Python. Here, we have illustrated an end-to-end example of using a dataset to build a K-means clustering model to achieve customer segmentation using(KMeans Clustering in Python.

References:

  • https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  • https://realpython.com/k-means-clustering-python/
  • https://www.kaggle.com/khotijahs1/k-means-clustering-of-iris-dataset
  • https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187

  • Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (7)

    Veer Kumar

    I am passionate about Analytics and I am looking for opportunities to hone my current skills to gain prominence in the field of Data Science.

    View all posts

Follow Us

Related Posts

Decision Tree Regression in Python Sklearn with Example

Random Forest Regression in Python Sklearn with Example

Sklearn Feature Scaling with StandardScaler, MinMaxScaler, RobustScaler and MaxAbsScaler

Tutorial for DBSCAN Clustering in Python Sklearn

Cross Validation in Sklearn | Hold Out Approach | K-Fold Cross Validation | LOOCV

Tutorial for K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge (2024)
Top Articles
Latest Posts
Article information

Author: Rob Wisoky

Last Updated:

Views: 5921

Rating: 4.8 / 5 (48 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Rob Wisoky

Birthday: 1994-09-30

Address: 5789 Michel Vista, West Domenic, OR 80464-9452

Phone: +97313824072371

Job: Education Orchestrator

Hobby: Lockpicking, Crocheting, Baton twirling, Video gaming, Jogging, Whittling, Model building

Introduction: My name is Rob Wisoky, I am a smiling, helpful, encouraging, zealous, energetic, faithful, fantastic person who loves writing and wants to share my knowledge and understanding with you.