K-means clustering

K-means partitions \(n\) observations into \(K\) clusters by minimizing the total within-cluster variance. It alternates between assigning each point to its nearest centroid and recomputing the centroids until convergence. Simple, fast, and scalable to large datasets, K-means is the default starting point for clustering. Its main limitation: it assumes spherical clusters of similar size.

Objective function

K-means minimizes the within-cluster sum of squares (WCSS):

\[\min_{C_1,\ldots,C_K} \sum_{k=1}^K \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2\]

where \(\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{\mathbf{x}_i \in C_k} \mathbf{x}_i\) is the centroid of cluster \(k\). This is equivalent to maximizing the between-cluster variance: a good clustering concentrates points within clusters and separates clusters from each other.

The optimization is NP-hard in general (exponential number of partitions to check), but the Lloyd’s algorithm finds a local minimum efficiently.

Lloyd’s algorithm

Initialization: place \(K\) centroids in the feature space (see K-means++ below).

Repeat until convergence:

Assignment step: assign each point to its nearest centroid:

\[c_i = \arg\min_{k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2\]

Update step: recompute each centroid as the mean of its assigned points:

\[\boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{i: c_i=k} \mathbf{x}_i\]

Each step decreases or maintains the WCSS, so the algorithm always converges. However, it converges to a local minimum, not necessarily the global one.

Four panels showing K-means iterations from random initialization to convergence with centroids and cluster assignments

Diamonds are centroids. After iteration 1 the assignments are rough; by iteration 10 the algorithm has converged to the correct three clusters.

K-means++ initialization

Random initialization can lead to poor local minima: if two centroids start in the same cluster, that cluster will be split artificially. K-means++ selects initial centroids that are spread out:

  1. Choose the first centroid uniformly at random from the data.
  2. For each subsequent centroid: choose a point with probability proportional to its squared distance from the nearest already-chosen centroid.
  3. Repeat until \(K\) centroids are chosen.

This gives a \(O(\log K)\) approximation guarantee to the optimal WCSS in expectation, and in practice reduces the number of iterations needed and the probability of poor solutions. K-means++ is the default in most implementations.

Choosing K

Elbow method

Plot WCSS as a function of \(K\). WCSS always decreases as \(K\) increases (with \(K=n\) each point is its own cluster, WCSS=0). The “elbow” is the point where adding another cluster gives diminishing returns.

Silhouette score

For each point \(i\), let \(a_i\) be its mean distance to points in its own cluster and \(b_i\) be its mean distance to points in the nearest other cluster:

\[s_i = \frac{b_i - a_i}{\max(a_i, b_i)} \in [-1, 1]\]

\(s_i \approx 1\): well-clustered. \(s_i \approx 0\): on the boundary between clusters. \(s_i < 0\): probably in the wrong cluster. The mean silhouette score over all points is maximized at the best \(K\).

Two panels showing the elbow method and silhouette score vs K for choosing the optimal number of clusters

Both methods correctly identify \(K=3\) as optimal for this dataset: the elbow is clear and the silhouette is maximized at \(K=3\).

When K-means fails

K-means assumes clusters are spherical (Euclidean distance to centroid), roughly equal in size, and have similar density. It fails when these assumptions are violated.

Three panels showing K-means failing on elongated clusters, ring-shaped data and clusters of different sizes

For non-spherical or differently-sized clusters, use DBSCAN (density-based) or hierarchical clustering with appropriate linkage.

⚠️ K-means is sensitive to outliers and feature scale

A single outlier far from any cluster centroid will pull the centroid toward it, distorting the entire cluster. Standardize features before running K-means (otherwise features with large ranges dominate the distance), and consider removing or down-weighting outliers first.

Also: K-means is non-deterministic. Always run with multiple random starts (nstart=20 or more) and take the solution with the lowest WCSS. A single run may converge to a poor local minimum.

💡 K-means in R

# K-means with 20 random starts (best solution returned)
km <- kmeans(X, centers=3, nstart=20, iter.max=100)
km$cluster        # cluster assignments
km$centers        # centroid coordinates
km$tot.withinss   # total WCSS

# Elbow method
wcss <- sapply(1:10, function(k)
  kmeans(X, centers=k, nstart=20)$tot.withinss)
plot(1:10, wcss, type="b", xlab="K", ylab="WCSS")

# Silhouette score
library(cluster)
sil <- silhouette(km$cluster, dist(X))
mean(sil[,3])   # mean silhouette
plot(sil)

# Gap statistic
gap <- clusGap(X, FUN=kmeans, nstart=20, K.max=10, B=50)
plot(gap)