Gianluca Turcatel
K-Means Clustering From Scratch
Introduction
K-Means clustering is an unsupervised machine learning algorithm that seeks to group alike data points together. It aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or centroid). For instance, if we are given the observations shown in the Figure below, we would like K-Means clustering to group the data in 4 distinct clusters:
Figure 1. Toy dataset.
Distance metrics
In order to know to which cluster an observation is closer (thus, belongs) to, we need a distance metric. The distance metrics normally utilized are based on the Minkowski distance:
The Minkowski distance is a generalization of the Euclidean distance and the Manhattan distance. When p = 1, we obtain the Manhattan distance (aka L1 distance metric) between the two points:
When p = 2, we obtain the Euclidean distance (aka L2 distance distance):
Finally when p = ∞, we obtain the Chebyshev distance, which is equivalent to the max among all |ai-bi|:
The figure below depicts how the three distances differs for 3 random data points on a 2D Cartesian graph (e.i. 2 features):
Figure 2. Comparison of the 3 distance metrics on a toy dataset.
The Euclidean distance is the metric used the most. However, with very high dimensional data the Manhattan distance should be utilized to reduce the effect of the curse of dimensionality.
Number of clusters
One of the most critical step of the K-Means clustering algorithm is picking the proper number of clusters. There are thee ways to determine the proper number of clusters:
1 - Rule of thumb. Set the number of clusters to the square root of half the number of observations. For instance with 500 data points, the number of clusters would be sqrt(500/2) = 16.
2 - Use another unsupervised method. A similar classification algorithm to K-Means could be used to tell us the most appropriate number of clusters. However, this strategy is seldomly employed.
3 - Elbow method. This method is the most commonly used to determine the number of clusters and leverages a statistical metric of homogeneity, called WCSS, Within Cluster Sum of Squares. Clustering implies grouping similar objects together, and its decreases as the number of clusters increases. If we group all data points of Figure 1 in one single cluster the homogeneity will be the lowest (and WCSS the highest); on the other hand if each observation became its own cluster WCSS would be zero (highest homogeneity possible). The optimal number of clusters lives somewhere between these two extreme opposite scenarios. The elbow method comes to the rescue: WCSS drops quickly as the number of clusters increases but eventually evens out, creating an elbow-like shape in the plot. At the elbow, the optimal number of clusters has been reached. Beyond the elbow, adding more clusters would barely improve the overall clusters homogeneity, meaning the data is unnecessarily subdivided.
WCSS is a function of the number of clusters and is given by:
where Cj is the center of the cluster j. Given the following observations and centroids:
Figure 3. Calculation of WCSS.
where A,B belong to cluster C1; C,D,E belong to cluster C2; C1 has coordinates (4.5, 5.5); C2 has coordinates (13, 6.3). WCSS is going to be equal to:
Note that each cluster's center (aka centroid) is located in the center of the data points assigned to it.
The figure below depicts an ideal plot of WCSS as a function of the number of clusters. WCSS drops sharply until n=4, but with n > 4, the decrease of WCSS is barely noticeable: the optimal number of cluster in this example is therefore 4.
Figure 4. Elbow method.
Centroids Initialization
One last thing to discuss before start coding is how to initialize the position the clusters' center. There are few methods available:
1 - Random uniform initialization. Clusters' center is randomly positioned within the data points within the minimum and maximum values of each feature. This is simplest, but not the best, way to initialize the centers: if they are not properly initialized throughout the entire data space, clustering may be sub optimal or, potentially, wrong.
2 - Assign centroids to data points. k data points are randomly chosen to be the centroids. This is a little better than 1, but can still lead to bad clustering.
3 - k-means++. This method tries to uniformly spread the centroids as far from one another as possible, covering the data space as much as possible. The intuition behind this approach is that homogeneous spreading of the centroids is a good thing. This approach goes like this: the first centroid is randomly picked among all observations; the other centroids are assigned to other data points at random based on a probability distribution where a point is chosen with probability proportional to the point's squared distance to the closest centroid: in other words the further the data point is from the closest already assigned centroid, the higher the chance that point will become a centroid.
4 - Naive Sharding. This is a deterministic method that evenly splits the data in k shards based on the value of the row-wise sum of all the features. For each shard the centroid is given by the features' mean. More details on Naive Sharding and on its implementation can be found HERE.
The algorithm
We have all the ingredients to code the K-Means clustering algorithm. K-Means clustering is an iterative algorithm and it will terminate when the the centers of the cluster are not moving anymore or the max number of iterations is reached.
The pseudocode is the following:
STEP1: Select the number of clusters
STEP2: Initialize the centroids' position
STEP3: Calculate the distance between every data point and every centroid
STEP4: Assign each data point to the closest centroid
STEP5: Update the center of each cluster based of the data points just assigned to it in STEP4
REPEAT 3-5 until the centroids are not moving anymore (or the max number of iterations is reached)
RETURN centroids' location, cluster-observation assignment map, WCSS
The full code can be found HERE. Running the code on the data in Figure 1 with number of clusters set to 4, leads to the following result:
Figure 5. K-Means clustering results.
In Figure 5, centroids are positioned in the center of each group of data points and the final clustering is aligned with our expectations and how the data was generated. As a sanity check let's confirm with the elbow method that 4 is the optimal number of clusters:
Figure 6. Elbow method on the observed data.
The elbow is located at number of clusters of 4, therefore everything checks out.
Scaling the data
The toy dataset in Figure 1, does not require data scaling because the values of X1 and X2 have very similar range, thus data scaling won't change the final result. That's not the case for the data shown below:
Figure 7. Data that should be scaled
In Figure 7, the X1 ranges from -3 to +2.5 while X2 ranges from -85 to 65. Applying K-Means clustering to the raw data would lead to this:
Figure 8. Data that should be normalized
which is non sense. The algorithm failed because the centroid-observation distances were strongly biased towards X2, the feature with highest absolute spread. Standardization of X1 and X2 fixes the issue and K-Means clustering succeeded:
Figure 9. Clustering succeeded after data standardization
Data was standardized to have mean of 0 and a standard deviation of 1:
K-Means clustering limitations
K-Means clustering suffers from several limitations:
1 - Results depend on the centroids initialization. This drawback can be solved by running the algorithm several times and picking the best result.
2 - Outliers can pull clusters all over the place. Outliers could be small clusters on their own or just outliers and should be removed to ovoid dragging centroids around.
3 - Data containing cluster of varying sizes and density. K-Means can fail when clustering data made of clusters of varying size and density.
Concluding remarks
We implemented K-Means clustering from scratch in Python. In the code I shared centroids were randomly initialized (option 1 of the Centroids Initialization section), which, as I stated, is prone to bad clustering. If you run my code several times, you may, indeed, witness that. Implementing k-means++ or naive sharding will prevent that from happening. I wanted to give everyone a heads up before ending this article.
Comments