K-Means 聚类详解

1. What is K-Means Clustering
2. Algorithm Step by Step
3. Objective Function -- WCSS
4. Choosing K -- Elbow, Silhouette, Gap Statistic
5. K-Means++ Initialization
6. Limitations with Mathematical Reasoning
7. Python Implementation
8. K-Means vs DBSCAN vs Hierarchical
9. Related Guides
10. FAQ

1. What is K-Means Clustering

K-Means is one of the most widely used unsupervised clustering algorithms. Its goal is to partition n data points into K clusters, where each point belongs to the cluster whose centroid (mean) is nearest. The name directly reveals its mechanism: "K" is the number of clusters, and "Means" indicates that each cluster is defined by the mean of its member points (the centroid).

Historical origin: The algorithm was first devised by Stuart Lloyd at Bell Labs in 1957 as a quantization technique for pulse-code modulation (PCM), though it was not published until 1982. Independently, Hugo Steinhaus described a similar partitioning idea in a 1956 paper. In 1967, James MacQueen coined the term "K-Means" in a paper that provided the first formal statistical analysis of the procedure. You will often see the algorithm referred to as Lloyd's algorithm.

Why is K-Means so popular? Three reasons: (1) the algorithm is conceptually simple and intuitive; (2) its time complexity is O(n * K * d * t) (n = samples, d = dimensions, t = iterations), making it fast even on large datasets; (3) it works well in many practical scenarios such as customer segmentation, image compression, and document clustering.

2. Algorithm Step by Step

Every step in K-Means has a clear mathematical motivation. Below we walk through the algorithm and explain WHY each step works.

Step 1: Initialize K Centroids

What: Randomly select K data points as initial centroids.

Why initialization matters: The K-Means objective function (WCSS) is non-convex, meaning it has multiple local optima. Different initial centroids lead the algorithm to converge to different local optima. If all initial centroids happen to fall in one region of the data, the algorithm may never discover the globally optimal clustering. This is precisely why Arthur and Vassilvitskii proposed K-Means++ in 2007 to systematically address this problem (see Section 5).

Step 2: Assign Each Point to the Nearest Centroid

What: For each data point, compute the Euclidean distance to all K centroids and assign it to the cluster of the nearest centroid.

Why this minimizes WCSS: With centroids held fixed, assigning each point to its nearest centroid is equivalent to minimizing each point's contribution to the total WCSS. Formally, for point x_i, choosing cluster k* = argmin_k ||x_i - mu_k||^2 minimizes the squared distance for that point under the current centroids. This step guarantees that WCSS does not increase.

Step 3: Update Centroids as the Cluster Mean

What: For each cluster, recompute the centroid as the mean of all points assigned to it: mu_k = (1/|C_k|) * sum_i x_i.

Why the mean -- derivative proof: With assignments held fixed, we seek the point c that minimizes the within-cluster SSE = sum_i ||x_i - c||^2. Taking the derivative with respect to c and setting it to zero: dSSE/dc = -2 * sum_i (x_i - c) = 0, which gives c = (1/n) * sum_i x_i -- the mean. Therefore the mean is the optimal representative point that minimizes squared error. This is why the algorithm is called K-"Means".

Step 4: Repeat Until Convergence

What: Repeat Steps 2 and 3 until centroids no longer change (or change less than a threshold).

Why convergence is guaranteed: In each iteration, Step 2 minimizes WCSS with fixed centroids (assignment phase), and Step 3 minimizes WCSS with fixed assignments (update phase). Therefore WCSS monotonically decreases (or stays the same) at every iteration. Additionally, there are finitely many possible assignments of n points to K clusters (at most K^n). Monotonic decrease + finite state space = guaranteed convergence. However, note that convergence is to a local optimum, not necessarily the global optimum.

# K-Means Algorithm Pseudocode
Input: Dataset X = {x_1, x_2, ..., x_n}, number of clusters K
Output: K clusters C_1, C_2, ..., C_K and centroids mu_1, mu_2, ..., mu_K

1. Randomly select K points as initial centroids mu_1, mu_2, ..., mu_K
2. repeat:
     For each point x_i:
       c_i = argmin_k ||x_i - mu_k||^2    (assign to nearest centroid)
     For each cluster k:
       mu_k = mean({x_i : c_i = k})        (update centroid to mean)
3. until centroids no longer change

3. Objective Function -- WCSS

K-Means optimizes the Within-Cluster Sum of Squares (WCSS), also called inertia:

WCSS (Within-Cluster Sum of Squares)

WCSS = sum_{k=1}^{K} sum_{i in C_k} ||x_i - mu_k||^2

where C_k is the set of points in cluster k, and mu_k is the centroid (mean) of cluster k.

Why this function: WCSS measures cluster compactness -- the total squared distance from each point to its assigned centroid. Lower WCSS means points are more tightly grouped around their centroids, indicating better clustering. Geometrically, minimizing WCSS is equivalent to minimizing the average pairwise distance within each cluster.

NP-hard problem: Exactly minimizing WCSS has been proven NP-hard. Mahajan, Nimbhorkar, and Varadarajan (2012) showed that exact K-Means is NP-hard even in the plane (2D) for general K. Aloise et al. (2009) further proved it is NP-hard even for K=2. Thus Lloyd's algorithm is fundamentally a heuristic -- it does not guarantee the global optimum, but in practice it typically finds a solution that is good enough.

WCSS and Cluster Variance

WCSS = sum_k |C_k| * Var(C_k)

Minimizing WCSS is equivalent to minimizing the weighted sum of cluster variances. This also explains why K-Means tends to produce roughly equal-sized spherical clusters -- the weighting by cluster size penalizes overly large clusters.

4. Choosing K -- Three Methods

K is the only hyperparameter in K-Means, but also the most critical one. Choosing wrong leads to under-clustering (too few clusters, merging distinct structures) or over-clustering (too many, splitting natural groups). The three methods below each have a solid mathematical foundation.

4.1 Elbow Method

Core idea: diminishing marginal returns

Proposed by: Robert L. Thorndike described this idea in his 1953 paper "Who Belongs in the Family?"

Method: Run K-Means for K = 1, 2, ..., K_max and plot the K vs. WCSS curve. Look for the "elbow" point where the curve transitions from steep decline to shallow plateau.

Why it works: As K increases from 1 toward the true number of clusters, each additional cluster captures a distinct data structure, causing a large WCSS drop. Once K exceeds the true count, new clusters merely subdivide already-compact groups, yielding only marginal WCSS reduction -- this is the principle of diminishing returns. The elbow point is the optimal balance between explanatory power and model complexity.

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

wcss = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    wcss.append(km.inertia_)

plt.plot(K_range, wcss, 'bo-')
plt.xlabel('K')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()

4.2 Silhouette Score

Core idea: balancing cohesion and separation

Proposed by: Peter J. Rousseeuw in his 1987 paper "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis."

Formula: For data point i, the silhouette coefficient is defined as:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

where a(i) = mean distance from point i to other points in the same cluster (cohesion), b(i) = mean distance from point i to points in the nearest neighboring cluster (separation). s(i) ranges from -1 to 1.

Why it works: The silhouette score elegantly measures two critical properties simultaneously: (1) cohesion -- smaller a(i) means the point is tightly bound to its cluster; (2) separation -- larger b(i) means the point is far from other clusters. When s(i) is near 1, b(i) >> a(i), meaning the point is well-assigned; when s(i) is near -1, a(i) >> b(i), suggesting misassignment. Choose the K that maximizes the average silhouette score.

from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X)
    scores.append(silhouette_score(X, labels))

best_k = range(2, 11)[scores.index(max(scores))]
print(f"Best K by silhouette: {best_k}")

4.3 Gap Statistic

Core idea: comparison against a null reference

Proposed by: Robert Tibshirani, Guenther Walther, and Trevor Hastie in their 2001 paper "Estimating the Number of Clusters in a Data Set via the Gap Statistic."

Formula: Gap(K) = E*[log(W_k)] - log(W_k)

where W_k is the WCSS for the actual data, and E*[log(W_k)] is the expected WCSS under a uniform random distribution (null hypothesis), estimated via Monte Carlo simulation of multiple random datasets.

Why it works: A common problem with the elbow method and silhouette score is that they always suggest a "best K" even when the data has no real cluster structure. The Gap Statistic solves this by introducing a null hypothesis -- it asks: "Compared to completely structureless random data, does the current K provide a statistically significant clustering improvement?" Choose the smallest K where Gap(K) >= Gap(K+1) - s_{K+1} (where s is the standard error).

5. K-Means++ Initialization

Proposed by: David Arthur and Sergei Vassilvitskii in their 2007 paper "k-means++: The Advantages of Careful Seeding," which received the Best Paper Award at ACM-SIAM Symposium on Discrete Algorithms (SODA).

Why K-Means++ is needed: Standard K-Means random initialization has a serious flaw: if initial centroids happen to cluster in one region of the data space, the algorithm will almost certainly converge to a poor local optimum. Arthur and Vassilvitskii proved that the K-Means++ initialization guarantees that the expected WCSS is at most O(log K) times the optimal -- a theoretical competitive ratio guarantee far superior to the arbitrarily bad results that random initialization can produce.

K-Means++ Algorithm Steps

Step 1: Uniformly at random, choose the first centroid c_1 from the dataset.

Step 2: For each data point x, compute the shortest distance to any already-chosen centroid: D(x) = min_j ||x - c_j||^2.

Step 3: Choose the next centroid with probability P(x) = D(x)^2 / sum_i D(x_i)^2. Why squared distance as probability: This ensures that points farther from existing centroids are more likely to be selected as new centroids, spreading the centroids across the data space. Squaring (rather than linear distance) provides a stronger bias toward distant points.

Step 4: Repeat Steps 2-3 until K centroids have been chosen.

Step 5: Use these K centroids as initial values and run the standard K-Means algorithm.

import numpy as np

def kmeans_plus_plus_init(X, K):
    """K-Means++ initialization"""
    n = X.shape[0]
    # Step 1: randomly pick first centroid
    centroids = [X[np.random.randint(n)]]

    for _ in range(1, K):
        # Step 2: distance from each point to nearest centroid
        dists = np.array([min(np.sum((x - c)**2) for c in centroids) for x in X])
        # Step 3: select next centroid with probability proportional to D^2
        probs = dists / dists.sum()
        idx = np.random.choice(n, p=probs)
        centroids.append(X[idx])

    return np.array(centroids)

Practical note: sklearn's KMeans uses init='k-means++' by default, so you automatically benefit from K-Means++ when using sklearn.

6. Limitations with Mathematical Reasoning

K-Means is simple and efficient, but it has several fundamental limitations, each with a mathematical root cause.

6.1 Can Only Discover Spherical (Convex) Clusters

Why: Euclidean distance + mean = spherical

K-Means uses Euclidean distance for assignment and the mean as the centroid. The equidistant surface of Euclidean distance is a hypersphere, and the mean minimizes squared error. Each cluster is therefore defined as a Voronoi region -- the set of all points closest to that centroid. Voronoi region boundaries are always hyperplanes (linear), so K-Means can never capture non-convex cluster shapes like crescents or rings. For non-convex clusters, use DBSCAN (density-based) or spectral clustering (graph-based).

6.2 Sensitive to Outliers

Why: the mean is not a robust estimator

The centroid is defined as the mean of cluster members, and the mean is highly sensitive to extreme values. A single outlier can significantly shift the centroid position. Mathematically, the breakdown point of the mean is 1/n -- just one extreme value can move the mean arbitrarily far. Alternative: K-Medoids (PAM algorithm), proposed by Kaufman and Rousseeuw in their 1990 book "Finding Groups in Data." K-Medoids uses a medoid (the actual data point in the cluster that minimizes the sum of distances to all other points) instead of the mean, providing much greater robustness to outliers. The trade-off is higher computational cost: O(n^2) vs O(n).

6.3 Tendency Toward Equal-Sized Clusters

Why: WCSS penalizes large clusters via weighting

Recall that WCSS = sum_k |C_k| * Var(C_k). This formula weights each cluster's variance by its size |C_k|, so large clusters are penalized more heavily. The algorithm tends to split large clusters into smaller ones even when this does not reflect the data's true structure. When real clusters differ greatly in size, K-Means often "steals" points from large clusters to assign them to small clusters in order to reduce total WCSS.

6.4 K Must Be Specified in Advance

Why this is a problem

In many real-world scenarios, we do not know how many natural clusters exist in the data. Although the elbow method and silhouette score can help choose K, they require subjective judgment and may give misleading results on noisy data. Algorithms like DBSCAN and Mean Shift can automatically determine the number of clusters, but they have their own hyperparameters to tune.

7. Python Implementation

7.1 K-Means from Scratch

import numpy as np

class KMeansFromScratch:
    def __init__(self, K=3, max_iters=100, tol=1e-4):
        self.K = K
        self.max_iters = max_iters
        self.tol = tol        # centroid shift tolerance

    def fit(self, X):
        n, d = X.shape
        # K-Means++ initialization
        self.centroids = self._init_centroids(X)

        for iteration in range(self.max_iters):
            # Step 2: assign each point to nearest centroid
            distances = self._calc_distances(X)
            self.labels = np.argmin(distances, axis=1)

            # Step 3: update centroids as cluster means
            new_centroids = np.zeros_like(self.centroids)
            for k in range(self.K):
                members = X[self.labels == k]
                if len(members) > 0:
                    new_centroids[k] = members.mean(axis=0)
                else:
                    new_centroids[k] = X[np.random.randint(n)]

            # Check convergence
            shift = np.linalg.norm(new_centroids - self.centroids)
            self.centroids = new_centroids
            if shift < self.tol:
                break

        self.inertia_ = self._calc_wcss(X)
        return self

    def predict(self, X):
        distances = self._calc_distances(X)
        return np.argmin(distances, axis=1)

    def _init_centroids(self, X):
        """K-Means++ initialization"""
        n = X.shape[0]
        centroids = [X[np.random.randint(n)]]
        for _ in range(1, self.K):
            dists = np.min([np.sum((X - c)**2, axis=1) for c in centroids], axis=0)
            probs = dists / dists.sum()
            centroids.append(X[np.random.choice(n, p=probs)])
        return np.array(centroids)

    def _calc_distances(self, X):
        return np.array([np.sum((X - c)**2, axis=1) for c in self.centroids]).T

    def _calc_wcss(self, X):
        return sum(np.sum((X[self.labels == k] - self.centroids[k])**2)
                   for k in range(self.K))

# Usage example
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)

km = KMeansFromScratch(K=4)
km.fit(X)
print(f"WCSS: {km.inertia_:.2f}")

7.2 sklearn in Practice

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Standardize data (K-Means is scale-sensitive due to Euclidean distance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow + Silhouette combined K selection
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
wcss, sil = [], []
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    wcss.append(km.inertia_)
    sil.append(silhouette_score(X_scaled, labels))

ax1.plot(range(2, 11), wcss, 'bo-')
ax1.set_xlabel('K'); ax1.set_ylabel('WCSS'); ax1.set_title('Elbow Method')
ax2.plot(range(2, 11), sil, 'ro-')
ax2.set_xlabel('K'); ax2.set_ylabel('Silhouette'); ax2.set_title('Silhouette Score')
plt.tight_layout(); plt.show()

# Cluster with best K
best_k = range(2, 11)[sil.index(max(sil))]
km = KMeans(n_clusters=best_k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
print(f"Best K: {best_k}, Silhouette: {max(sil):.4f}")

8. K-Means vs DBSCAN vs Hierarchical Clustering

Choosing a clustering algorithm depends on the characteristics of your data. Below is a detailed comparison across multiple dimensions with the design philosophy behind each algorithm.

Dimension	K-Means	DBSCAN	Hierarchical
Proposed by	Lloyd (1957/1982)	Ester, Kriegel, Sander, Xu (1996), KDD conference	Multiple independent origins; Ward (1963) proposed minimum variance linkage
Core idea	Minimize within-cluster sum of squares (WCSS)	Density-based: core points, border points, noise points	Recursive merge/split, builds a dendrogram
Need to specify K?	Yes, must be set in advance	No, determined automatically. Requires eps (neighborhood radius) and min_samples	No, cut the dendrogram at any level for different K
Cluster shape	Spherical / convex only (Voronoi partitioning)	Arbitrary shapes (density-connected regions)	Depends on linkage: single-link finds chain-like clusters, complete-link favors spherical
Outlier handling	Poor -- outliers are forced into a cluster and shift centroids	Good -- low-density points are labeled as noise (-1)	Moderate -- depends on linkage; single-link is susceptible to chaining
Time complexity	O(n * K * d * t), near-linear	O(n log n) with KD-tree, O(n^2) worst case	O(n^2 log n) or O(n^3), impractical for large data
Scalability	Excellent, handles millions of points. Mini-Batch K-Means is even faster	Moderate, sensitive to eps choice, struggles in high dimensions (curse of dimensionality)	Poor, O(n^2) memory, usually limited to a few thousand samples
Best use case	Spherical clusters, large data, feature quantization, image compression	Arbitrary-shape clusters, noisy data, geospatial clustering	Small datasets, need hierarchical structure, gene expression analysis

Why DBSCAN Does Not Need K

DBSCAN (Ester et al., 1996 KDD conference) defines three point types: (1) core points: at least min_samples neighbors within eps radius; (2) border points: within eps of a core point but with insufficient neighbors themselves; (3) noise points: neither core nor border. Clusters form automatically from density-reachable core points -- no need to specify K. DBSCAN stands for "Density-Based Spatial Clustering of Applications with Noise."

Why Hierarchical Dendrograms Are Useful

Agglomerative hierarchical clustering starts with each point as its own cluster and iteratively merges the most similar pair until all points form one cluster. This process produces a dendrogram that records the merge distance at each step. By cutting the dendrogram at different heights, you get clustering results at different granularities -- no need to pre-select K. Ward linkage (Ward, 1963) minimizes the variance increase at each merge, making it most consistent with K-Means' WCSS objective.

10. FAQ

Q: Should I standardize my data before running K-Means?

Yes, strongly recommended. K-Means uses Euclidean distance to measure point-to-centroid distance, and Euclidean distance is highly sensitive to feature scale. For example, if feature A ranges over [0, 1000] and feature B over [0, 1], the distance is almost entirely dominated by feature A, rendering feature B irrelevant to clustering. Using StandardScaler (zero mean, unit variance) or MinMaxScaler ensures all features contribute equally to distance computation.

Q: What does the n_init parameter in sklearn's KMeans mean?

Since K-Means can converge to local optima, sklearn's KMeans runs the algorithm n_init=10 times by default with different random seeds, then selects the result with the lowest WCSS. Increasing n_init improves the probability of finding a better solution but linearly increases computation time. For critical tasks, consider setting it to 20-50.

Q: Can K-Means handle categorical data?

Not directly. K-Means relies on Euclidean distance and the mean operation, which have no mathematical meaning for categorical data (e.g., what is the mean of "red" and "blue"?). For categorical data, use the K-Modes algorithm (Huang, 1998), which replaces the mean with the mode and Euclidean distance with Hamming distance. For mixed data (both numerical and categorical features), use the K-Prototypes algorithm.

Q: What is the difference between Mini-Batch K-Means and standard K-Means?

Mini-Batch K-Means (Sculley, 2010) uses only a random subset (mini-batch) of the data to update centroids at each iteration, rather than the full dataset. This makes it several to tens of times faster than standard K-Means on large datasets, at the cost of slightly lower clustering quality (typically within 1-3%). sklearn provides the MiniBatchKMeans class. When data exceeds 100,000 samples, the Mini-Batch version is recommended.

Q: How do I evaluate K-Means clustering quality without ground truth labels?

When ground truth labels are unavailable (the common case in unsupervised learning), rely on internal evaluation metrics: (1) Silhouette Score -- closer to 1 is better, measures both within-cluster compactness and between-cluster separation; (2) Calinski-Harabasz Index -- ratio of between-cluster to within-cluster variance, higher is better; (3) Davies-Bouldin Index -- measures inter-cluster similarity, lower is better. When ground truth is available, use Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) to evaluate agreement between clustering and true labels.