K-Means 聚类详解
Table of Contents
1. What is K-Means Clustering
K-Means is one of the most widely used unsupervised clustering algorithms. Its goal is to partition n data points into K clusters, where each point belongs to the cluster whose centroid (mean) is nearest. The name directly reveals its mechanism: "K" is the number of clusters, and "Means" indicates that each cluster is defined by the mean of its member points (the centroid).
Historical origin: The algorithm was first devised by Stuart Lloyd at Bell Labs in 1957 as a quantization technique for pulse-code modulation (PCM), though it was not published until 1982. Independently, Hugo Steinhaus described a similar partitioning idea in a 1956 paper. In 1967, James MacQueen coined the term "K-Means" in a paper that provided the first formal statistical analysis of the procedure. You will often see the algorithm referred to as Lloyd's algorithm.
Why is K-Means so popular? Three reasons: (1) the algorithm is conceptually simple and intuitive; (2) its time complexity is O(n * K * d * t) (n = samples, d = dimensions, t = iterations), making it fast even on large datasets; (3) it works well in many practical scenarios such as customer segmentation, image compression, and document clustering.
2. Algorithm Step by Step
Every step in K-Means has a clear mathematical motivation. Below we walk through the algorithm and explain WHY each step works.
Step 1: Initialize K Centroids
Why initialization matters: The K-Means objective function (WCSS) is non-convex, meaning it has multiple local optima. Different initial centroids lead the algorithm to converge to different local optima. If all initial centroids happen to fall in one region of the data, the algorithm may never discover the globally optimal clustering. This is precisely why Arthur and Vassilvitskii proposed K-Means++ in 2007 to systematically address this problem (see Section 5).
Step 2: Assign Each Point to the Nearest Centroid
Why this minimizes WCSS: With centroids held fixed, assigning each point to its nearest centroid is equivalent to minimizing each point's contribution to the total WCSS. Formally, for point x_i, choosing cluster k* = argmin_k ||x_i - mu_k||^2 minimizes the squared distance for that point under the current centroids. This step guarantees that WCSS does not increase.
Step 3: Update Centroids as the Cluster Mean
Why the mean -- derivative proof: With assignments held fixed, we seek the point c that minimizes the within-cluster SSE = sum_i ||x_i - c||^2. Taking the derivative with respect to c and setting it to zero: dSSE/dc = -2 * sum_i (x_i - c) = 0, which gives c = (1/n) * sum_i x_i -- the mean. Therefore the mean is the optimal representative point that minimizes squared error. This is why the algorithm is called K-"Means".
Step 4: Repeat Until Convergence
Why convergence is guaranteed: In each iteration, Step 2 minimizes WCSS with fixed centroids (assignment phase), and Step 3 minimizes WCSS with fixed assignments (update phase). Therefore WCSS monotonically decreases (or stays the same) at every iteration. Additionally, there are finitely many possible assignments of n points to K clusters (at most K^n). Monotonic decrease + finite state space = guaranteed convergence. However, note that convergence is to a local optimum, not necessarily the global optimum.
3. Objective Function -- WCSS
K-Means optimizes the Within-Cluster Sum of Squares (WCSS), also called inertia:
WCSS (Within-Cluster Sum of Squares)
WCSS = sum_{k=1}^{K} sum_{i in C_k} ||x_i - mu_k||^2
where C_k is the set of points in cluster k, and mu_k is the centroid (mean) of cluster k.
Why this function: WCSS measures cluster compactness -- the total squared distance from each point to its assigned centroid. Lower WCSS means points are more tightly grouped around their centroids, indicating better clustering. Geometrically, minimizing WCSS is equivalent to minimizing the average pairwise distance within each cluster.
NP-hard problem: Exactly minimizing WCSS has been proven NP-hard. Mahajan, Nimbhorkar, and Varadarajan (2012) showed that exact K-Means is NP-hard even in the plane (2D) for general K. Aloise et al. (2009) further proved it is NP-hard even for K=2. Thus Lloyd's algorithm is fundamentally a heuristic -- it does not guarantee the global optimum, but in practice it typically finds a solution that is good enough.
WCSS and Cluster Variance
WCSS = sum_k |C_k| * Var(C_k)
Minimizing WCSS is equivalent to minimizing the weighted sum of cluster variances. This also explains why K-Means tends to produce roughly equal-sized spherical clusters -- the weighting by cluster size penalizes overly large clusters.
4. Choosing K -- Three Methods
K is the only hyperparameter in K-Means, but also the most critical one. Choosing wrong leads to under-clustering (too few clusters, merging distinct structures) or over-clustering (too many, splitting natural groups). The three methods below each have a solid mathematical foundation.
4.1 Elbow Method
Core idea: diminishing marginal returns
Proposed by: Robert L. Thorndike described this idea in his 1953 paper "Who Belongs in the Family?"
Method: Run K-Means for K = 1, 2, ..., K_max and plot the K vs. WCSS curve. Look for the "elbow" point where the curve transitions from steep decline to shallow plateau.
Why it works: As K increases from 1 toward the true number of clusters, each additional cluster captures a distinct data structure, causing a large WCSS drop. Once K exceeds the true count, new clusters merely subdivide already-compact groups, yielding only marginal WCSS reduction -- this is the principle of diminishing returns. The elbow point is the optimal balance between explanatory power and model complexity.
4.2 Silhouette Score
Core idea: balancing cohesion and separation
Proposed by: Peter J. Rousseeuw in his 1987 paper "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis."
Formula: For data point i, the silhouette coefficient is defined as:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
where a(i) = mean distance from point i to other points in the same cluster (cohesion), b(i) = mean distance from point i to points in the nearest neighboring cluster (separation). s(i) ranges from -1 to 1.
Why it works: The silhouette score elegantly measures two critical properties simultaneously: (1) cohesion -- smaller a(i) means the point is tightly bound to its cluster; (2) separation -- larger b(i) means the point is far from other clusters. When s(i) is near 1, b(i) >> a(i), meaning the point is well-assigned; when s(i) is near -1, a(i) >> b(i), suggesting misassignment. Choose the K that maximizes the average silhouette score.
4.3 Gap Statistic
Core idea: comparison against a null reference
Proposed by: Robert Tibshirani, Guenther Walther, and Trevor Hastie in their 2001 paper "Estimating the Number of Clusters in a Data Set via the Gap Statistic."
Formula: Gap(K) = E*[log(W_k)] - log(W_k)
where W_k is the WCSS for the actual data, and E*[log(W_k)] is the expected WCSS under a uniform random distribution (null hypothesis), estimated via Monte Carlo simulation of multiple random datasets.
Why it works: A common problem with the elbow method and silhouette score is that they always suggest a "best K" even when the data has no real cluster structure. The Gap Statistic solves this by introducing a null hypothesis -- it asks: "Compared to completely structureless random data, does the current K provide a statistically significant clustering improvement?" Choose the smallest K where Gap(K) >= Gap(K+1) - s_{K+1} (where s is the standard error).
5. K-Means++ Initialization
Proposed by: David Arthur and Sergei Vassilvitskii in their 2007 paper "k-means++: The Advantages of Careful Seeding," which received the Best Paper Award at ACM-SIAM Symposium on Discrete Algorithms (SODA).
Why K-Means++ is needed: Standard K-Means random initialization has a serious flaw: if initial centroids happen to cluster in one region of the data space, the algorithm will almost certainly converge to a poor local optimum. Arthur and Vassilvitskii proved that the K-Means++ initialization guarantees that the expected WCSS is at most O(log K) times the optimal -- a theoretical competitive ratio guarantee far superior to the arbitrarily bad results that random initialization can produce.
K-Means++ Algorithm Steps
Step 2: For each data point x, compute the shortest distance to any already-chosen centroid: D(x) = min_j ||x - c_j||^2.
Step 3: Choose the next centroid with probability P(x) = D(x)^2 / sum_i D(x_i)^2. Why squared distance as probability: This ensures that points farther from existing centroids are more likely to be selected as new centroids, spreading the centroids across the data space. Squaring (rather than linear distance) provides a stronger bias toward distant points.
Step 4: Repeat Steps 2-3 until K centroids have been chosen.
Step 5: Use these K centroids as initial values and run the standard K-Means algorithm.
Practical note: sklearn's KMeans uses init='k-means++' by default, so you automatically benefit from K-Means++ when using sklearn.
6. Limitations with Mathematical Reasoning
K-Means is simple and efficient, but it has several fundamental limitations, each with a mathematical root cause.
6.1 Can Only Discover Spherical (Convex) Clusters
Why: Euclidean distance + mean = spherical
K-Means uses Euclidean distance for assignment and the mean as the centroid. The equidistant surface of Euclidean distance is a hypersphere, and the mean minimizes squared error. Each cluster is therefore defined as a Voronoi region -- the set of all points closest to that centroid. Voronoi region boundaries are always hyperplanes (linear), so K-Means can never capture non-convex cluster shapes like crescents or rings. For non-convex clusters, use DBSCAN (density-based) or spectral clustering (graph-based).
6.2 Sensitive to Outliers
Why: the mean is not a robust estimator
The centroid is defined as the mean of cluster members, and the mean is highly sensitive to extreme values. A single outlier can significantly shift the centroid position. Mathematically, the breakdown point of the mean is 1/n -- just one extreme value can move the mean arbitrarily far. Alternative: K-Medoids (PAM algorithm), proposed by Kaufman and Rousseeuw in their 1990 book "Finding Groups in Data." K-Medoids uses a medoid (the actual data point in the cluster that minimizes the sum of distances to all other points) instead of the mean, providing much greater robustness to outliers. The trade-off is higher computational cost: O(n^2) vs O(n).
6.3 Tendency Toward Equal-Sized Clusters
Why: WCSS penalizes large clusters via weighting
Recall that WCSS = sum_k |C_k| * Var(C_k). This formula weights each cluster's variance by its size |C_k|, so large clusters are penalized more heavily. The algorithm tends to split large clusters into smaller ones even when this does not reflect the data's true structure. When real clusters differ greatly in size, K-Means often "steals" points from large clusters to assign them to small clusters in order to reduce total WCSS.
6.4 K Must Be Specified in Advance
Why this is a problem
In many real-world scenarios, we do not know how many natural clusters exist in the data. Although the elbow method and silhouette score can help choose K, they require subjective judgment and may give misleading results on noisy data. Algorithms like DBSCAN and Mean Shift can automatically determine the number of clusters, but they have their own hyperparameters to tune.
7. Python Implementation
7.1 K-Means from Scratch
7.2 sklearn in Practice
8. K-Means vs DBSCAN vs Hierarchical Clustering
Choosing a clustering algorithm depends on the characteristics of your data. Below is a detailed comparison across multiple dimensions with the design philosophy behind each algorithm.
| Dimension | K-Means | DBSCAN | Hierarchical |
|---|---|---|---|
| Proposed by | Lloyd (1957/1982) | Ester, Kriegel, Sander, Xu (1996), KDD conference | Multiple independent origins; Ward (1963) proposed minimum variance linkage |
| Core idea | Minimize within-cluster sum of squares (WCSS) | Density-based: core points, border points, noise points | Recursive merge/split, builds a dendrogram |
| Need to specify K? | Yes, must be set in advance | No, determined automatically. Requires eps (neighborhood radius) and min_samples | No, cut the dendrogram at any level for different K |
| Cluster shape | Spherical / convex only (Voronoi partitioning) | Arbitrary shapes (density-connected regions) | Depends on linkage: single-link finds chain-like clusters, complete-link favors spherical |
| Outlier handling | Poor -- outliers are forced into a cluster and shift centroids | Good -- low-density points are labeled as noise (-1) | Moderate -- depends on linkage; single-link is susceptible to chaining |
| Time complexity | O(n * K * d * t), near-linear | O(n log n) with KD-tree, O(n^2) worst case | O(n^2 log n) or O(n^3), impractical for large data |
| Scalability | Excellent, handles millions of points. Mini-Batch K-Means is even faster | Moderate, sensitive to eps choice, struggles in high dimensions (curse of dimensionality) | Poor, O(n^2) memory, usually limited to a few thousand samples |
| Best use case | Spherical clusters, large data, feature quantization, image compression | Arbitrary-shape clusters, noisy data, geospatial clustering | Small datasets, need hierarchical structure, gene expression analysis |
Why DBSCAN Does Not Need K
DBSCAN (Ester et al., 1996 KDD conference) defines three point types: (1) core points: at least min_samples neighbors within eps radius; (2) border points: within eps of a core point but with insufficient neighbors themselves; (3) noise points: neither core nor border. Clusters form automatically from density-reachable core points -- no need to specify K. DBSCAN stands for "Density-Based Spatial Clustering of Applications with Noise."
Why Hierarchical Dendrograms Are Useful
Agglomerative hierarchical clustering starts with each point as its own cluster and iteratively merges the most similar pair until all points form one cluster. This process produces a dendrogram that records the merge distance at each step. By cutting the dendrogram at different heights, you get clustering results at different granularities -- no need to pre-select K. Ward linkage (Ward, 1963) minimizes the variance increase at each merge, making it most consistent with K-Means' WCSS objective.
9. Related Guides
10. FAQ
Yes, strongly recommended. K-Means uses Euclidean distance to measure point-to-centroid distance, and Euclidean distance is highly sensitive to feature scale. For example, if feature A ranges over [0, 1000] and feature B over [0, 1], the distance is almost entirely dominated by feature A, rendering feature B irrelevant to clustering. Using StandardScaler (zero mean, unit variance) or MinMaxScaler ensures all features contribute equally to distance computation.
Since K-Means can converge to local optima, sklearn's KMeans runs the algorithm n_init=10 times by default with different random seeds, then selects the result with the lowest WCSS. Increasing n_init improves the probability of finding a better solution but linearly increases computation time. For critical tasks, consider setting it to 20-50.
Not directly. K-Means relies on Euclidean distance and the mean operation, which have no mathematical meaning for categorical data (e.g., what is the mean of "red" and "blue"?). For categorical data, use the K-Modes algorithm (Huang, 1998), which replaces the mean with the mode and Euclidean distance with Hamming distance. For mixed data (both numerical and categorical features), use the K-Prototypes algorithm.
Mini-Batch K-Means (Sculley, 2010) uses only a random subset (mini-batch) of the data to update centroids at each iteration, rather than the full dataset. This makes it several to tens of times faster than standard K-Means on large datasets, at the cost of slightly lower clustering quality (typically within 1-3%). sklearn provides the MiniBatchKMeans class. When data exceeds 100,000 samples, the Mini-Batch version is recommended.
When ground truth labels are unavailable (the common case in unsupervised learning), rely on internal evaluation metrics: (1) Silhouette Score -- closer to 1 is better, measures both within-cluster compactness and between-cluster separation; (2) Calinski-Harabasz Index -- ratio of between-cluster to within-cluster variance, higher is better; (3) Davies-Bouldin Index -- measures inter-cluster similarity, lower is better. When ground truth is available, use Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) to evaluate agreement between clustering and true labels.