KNN 算法详解

1. What is KNN

K-Nearest Neighbors (KNN) was formally introduced by Thomas Cover and Peter Hart in their 1967 paper "Nearest Neighbor Pattern Classification". It is one of the most intuitive algorithms in machine learning: given an unknown sample, find the K closest samples in the training set and use their labels to vote on the prediction.

KNN is called a "lazy learner". Why "lazy"? Because KNN performs zero computation during the training phase -- it simply stores the training data in memory. All computation (distance calculation, sorting, voting) is deferred to prediction time. This contrasts with "eager learners" (like logistic regression and SVM) that build explicit model parameters during training.

Why this is a strength: No training time; the model instantly adapts to new data (just append to the training set); decision boundaries can be arbitrarily nonlinear, unconstrained by model assumptions.
Why this is a weakness: Prediction time scales linearly with training set size (must compute distance to every training sample), O(n*d); must store all training data, causing high memory usage; cannot extract interpretable rules or feature importance.
Cover-Hart Theorem (1967): As the number of training samples n approaches infinity, the error rate of the 1-NN classifier is no more than twice the Bayes optimal error rate. This theorem explains why such a simple algorithm has strong theoretical guarantees -- it shows that with sufficient data, the nearest neighbor method approaches theoretical optimality.

2. How KNN Works

The KNN workflow can be decomposed into the following steps, each with a clear mathematical rationale:

  1. Store training data: Save all training samples (x_i, y_i) in memory. Why no preprocessing? Because KNN assumes the prediction data follows the same distribution as training data, and preserving raw data maximizes information retention.
  2. Compute distances: Given a query point x_q, compute its distance d(x_q, x_i) to every training sample x_i. Why use distance as a similarity measure? Because in the feature space, similar samples should be geometrically close -- this is KNN's core assumption (local consistency assumption).
  3. Sort and select neighbors: Sort by distance in ascending order and select the K nearest samples. Why not just pick the single nearest one? Because a single sample is susceptible to noise (high variance), and voting across multiple neighbors smooths out noise.
  4. Vote or average: Classification: the most frequent class among the K neighbors becomes the prediction (majority vote). Regression: average the target values of the K neighbors. Why majority vote? Because in a local region, the most common class is most likely the true class (an intuitive version of the maximum likelihood principle).
  5. Output prediction: Return the vote result or average value.
Classification: y_pred = argmax_c SUM_{i in N_K(x_q)} I(y_i = c) Regression: y_pred = (1/K) * SUM_{i in N_K(x_q)} y_i Where N_K(x_q) is the set of K nearest neighbors of x_q I(.) is the indicator function, returning 1 when the condition is true

3. Distance Metrics

The choice of distance metric directly determines "who is a neighbor," making it critical to KNN performance. Different metrics suit different data characteristics, and each exists for specific mathematical and practical reasons.

3.1 Euclidean Distance (L2)

d(x, y) = sqrt( SUM_{i=1}^{d} (x_i - y_i)^2 )

Why is it the default? Euclidean distance is the mathematical formalization of the intuitive "straight-line distance." It is rotationally invariant -- the distance between two points remains the same regardless of how the coordinate system is rotated. This means it has no directional bias and treats all dimensions equally. When features are on the same scale and equally important, Euclidean distance is the most natural choice.

Proposed by: Based on Euclidean geometry (circa 300 BC), adopted as the default in modern KNN by Cover & Hart (1967).

Best for: Continuous numerical features, standardized features, low-to-medium dimensionality.

3.2 Manhattan Distance (L1)

d(x, y) = SUM_{i=1}^{d} |x_i - y_i|

Why does it exist? In high-dimensional sparse data, Euclidean distance suffers from "distance concentration" -- all pairwise distances converge to similar values, destroying discriminative power. Aggarwal, Hinneburg, and Keim proved in their 2001 paper "On the Surprising Behavior of Distance Metrics in High Dimensional Space" that as dimensionality increases, L1 (Manhattan) distance preserves better contrast than L2 (Euclidean). The mathematical reason is that L1 has more uniform sensitivity to per-dimension differences, unlike L2 where a few large differences dominate due to squaring.

Best for: High-dimensional sparse data, grid-like path problems, features with large value ranges where squaring amplification is undesirable.

3.3 Minkowski Distance (Lp)

d(x, y) = ( SUM_{i=1}^{d} |x_i - y_i|^p )^(1/p) p=1 reduces to Manhattan distance p=2 reduces to Euclidean distance p->inf reduces to Chebyshev distance (max difference across dimensions)

Why does it exist? Minkowski distance is the unified generalization of L1 and L2, proposed by Hermann Minkowski (1896). The parameter p provides a continuous knob to adjust between L1 and L2 behavior. Smaller p values are more "democratic" across dimensions (each contributes similarly), while larger p values let the dimension with the largest difference dominate. Choosing the optimal p via cross-validation can tailor the metric to the specific data distribution.

Best for: When you are unsure whether L1 or L2 is better -- treat p as a tunable hyperparameter.

3.4 Cosine Distance

cosine_similarity(x, y) = (x . y) / (||x|| * ||y||) cosine_distance(x, y) = 1 - cosine_similarity(x, y)

Why does it exist? In text classification (TF-IDF vectors) and recommendation systems, what matters is the direction of two vectors, not their magnitude. For example, a document mentioning "machine learning" 10 times and another mentioning it 100 times may have the same topic -- they just differ in length. Cosine distance ignores vector magnitude (it is magnitude-invariant), comparing only direction. This captures "topic similarity" rather than "quantity similarity," which is why cosine distance is the standard choice in NLP.

Proposed by: Developed in the information retrieval field, popularized by Gerard Salton (1971) in the vector space model.

Best for: Text classification, document similarity, recommendation systems, high-dimensional sparse vectors.

3.5 Distance Metric Comparison

MetricPropertiesBest ForWeakness
Euclidean (L2)Rotationally invariant; influenced by large per-dimension differencesContinuous features, low-D, standardized dataContrast degrades in high-D
Manhattan (L1)Uniform per-dimension contribution; more robust to outliersHigh-D sparse data, many discrete featuresNot rotationally invariant
Minkowski (Lp)Tunable p parameter balances L1/L2 behaviorWhen metric tuning is neededExtra hyperparameter adds complexity
CosineDirection-only; ignores magnitudeText/NLP, recommendation systemsUnsuitable when magnitude is meaningful

4. Choosing K

Choosing K is the most critical hyperparameter decision in KNN, directly controlling the bias-variance tradeoff.

4.1 Bias-Variance Tradeoff Explained

Small K (e.g., K=1) -- Low bias / High variance: The decision boundary tightly follows every training point, capturing fine local patterns. Why low bias? Because the model almost perfectly fits the training data (1-NN has zero training error). Why high variance? Because a small perturbation in the data (adding or removing one sample) can significantly change the decision boundary. The model is oversensitive to noise and prone to overfitting.

Large K (e.g., K=n) -- High bias / Low variance: The decision boundary becomes extremely smooth. Why high bias? Because the model over-averages, ignoring local structure (in the extreme case K=n, every query predicts the same class -- the majority class). Why low variance? Because voting over many neighbors makes predictions very stable, unaffected by individual samples.

Total Error = Bias^2 + Variance + Irreducible Noise The optimal K minimizes total error -- the sweet spot between bias and variance.

4.2 Why Use Odd K for Binary Classification

In binary classification, if K is even (e.g., K=4), you may get a 2-vs-2 tie, forcing the algorithm to use a tiebreaker rule (such as random selection or nearest-neighbor preference), which introduces nondeterminism. Using odd K mathematically guarantees no ties (since an odd number cannot be evenly split), making predictions deterministic.

Note: In multiclass problems, ties can still occur with odd K (e.g., 3 classes each receiving 1 vote with K=3), so the odd-K recommendation applies primarily to binary classification.

4.3 Cross-Validation for K Selection

The most reliable approach in practice is to use k-fold cross-validation (note: lowercase k here differs from the KNN parameter K) to select the optimal K. A common procedure is to test odd values in the range K = 1, 3, 5, 7, ..., sqrt(n), selecting the K with the highest validation accuracy.

Rule of thumb: sqrt(n) is a reasonable upper bound for K (where n is the number of training samples). Why? Because if K is much larger than sqrt(n), the neighborhood spans too much of the data space, destroying locality.

from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier import numpy as np k_range = range(1, 32, 2) # odd K values: 1, 3, 5, ..., 31 scores = [] for k in k_range: knn = KNeighborsClassifier(n_neighbors=k) cv_scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy') scores.append(cv_scores.mean()) best_k = list(k_range)[np.argmax(scores)] print(f"Best K = {best_k}, Accuracy = {max(scores):.4f}")

5. Curse of Dimensionality

The Curse of Dimensionality is the core reason KNN fails on high-dimensional data. The term was coined by Richard Bellman in 1961 in his book "Adaptive Control Processes."

5.1 Why KNN Fails in High Dimensions

Core problem: As dimensionality d increases, distances between data points converge to similar values, making the difference between "nearest neighbor" and "farthest neighbor" negligible. Why does this happen?

Mathematical explanation: Consider n points uniformly distributed in a d-dimensional unit hypercube [0,1]^d. The expected Euclidean distance between two random points is E[d] ~ sqrt(d/6), while the standard deviation is Std[d] ~ O(1). Therefore:

Distance contrast = Std[d] / E[d] ~ O(1/sqrt(d)) As d -> inf, contrast -> 0 This means all pairwise distances become nearly identical!

When nearest and farthest neighbors are nearly equidistant, the concept of "neighbor" based on distance loses meaning -- KNN's core assumption (nearby points share labels) no longer holds.

5.2 Hughes Phenomenon

Gordon Hughes (1968) discovered a counterintuitive phenomenon in his paper "On the Mean Accuracy of Statistical Pattern Recognizers": with a fixed number of training samples, classifier performance first improves then degrades as feature dimensionality increases. Why?

Adding dimensions initially provides more information, but beyond a critical point the high-dimensional space becomes extremely sparse (data points cannot adequately cover the feature space), and the model begins fitting noise dimensions rather than useful signal. The required number of training samples grows exponentially with dimensionality -- to maintain the same data density in d dimensions, you need n^(d/d_0) times more samples than in lower dimensions.

Example: 10 samples cover [0,1] in 1D, 100 samples needed to cover [0,1]^2 in 2D, 10^10 samples needed to cover [0,1]^10 in 10D!

5.3 Solution: Dimensionality Reduction Before KNN

PCA (Principal Component Analysis) + KNN is the classic combination strategy. Why does PCA help? PCA projects data onto the directions of maximum variance via linear transformation, eliminating redundant and noisy dimensions while preserving the most informative components. After reduction, data density increases and distance metrics become meaningful again.

from sklearn.decomposition import PCA from sklearn.pipeline import Pipeline from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler # PCA + KNN pipeline: standardize, reduce dimensions, then KNN pipe = Pipeline([ ('scaler', StandardScaler()), ('pca', PCA(n_components=0.95)), # retain 95% variance ('knn', KNeighborsClassifier(n_neighbors=5)) ]) pipe.fit(X_train, y_train) print(f"Accuracy after PCA: {pipe.score(X_test, y_test):.4f}")

Other reduction methods: t-SNE (visualization only, not suitable for KNN prediction), UMAP (faster and preserves global structure), feature selection (e.g., mutual information, L1 regularization).

6. Feature Scaling

Feature scaling is not optional for KNN -- it is a mandatory preprocessing step. Why?

Core reason: KNN is distance-based, and differing scales across features cause some features to unfairly dominate the distance calculation. For example, "age" ranges [0, 100] while "annual income" ranges [0, 1,000,000] -- without scaling, Euclidean distance is almost entirely determined by income, and age has negligible influence.

Unscaled: d = sqrt((30-25)^2 + (500000-100000)^2) = sqrt(25 + 160000000000) ~ 400000 The age difference of 5 is drowned in a distance of 400,000! Scaled: d = sqrt((0.5-0.3)^2 + (0.6-0.1)^2) = sqrt(0.04 + 0.25) ~ 0.54 Both features contribute proportionally.

Scaling Methods Compared

MethodFormulaBest ForWhy Choose It
StandardScaler(x - mean) / stdNormally distributed dataPreserves distribution shape, centers to mean 0, std 1
MinMaxScaler(x - min) / (max - min)Bounded dataMaps all features to [0,1], good when no outliers
RobustScaler(x - median) / IQRData with outliersBased on median and IQR, robust to outliers

7. Python Implementation

7.1 KNN from Scratch (20 Lines)

The following code demonstrates the core logic of a KNN classifier, helping you understand each step of the algorithm:

import numpy as np from collections import Counter class KNNClassifier: def __init__(self, k=5): self.k = k def fit(self, X, y): # "Lazy" learning: just store data, no computation self.X_train = np.array(X) self.y_train = np.array(y) return self def predict(self, X): X = np.array(X) return np.array([self._predict_one(x) for x in X]) def _predict_one(self, x): # Compute Euclidean distance to all training samples dists = np.sqrt(np.sum((self.X_train - x) ** 2, axis=1)) # Select indices of K nearest neighbors k_idx = np.argsort(dists)[:self.k] # Majority vote k_labels = self.y_train[k_idx] most_common = Counter(k_labels).most_common(1) return most_common[0][0]

7.2 Using sklearn

from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.metrics import classification_report from sklearn.datasets import load_iris # Load data X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Feature scaling (mandatory for KNN!) scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # Grid search for optimal K and distance metric param_grid = { 'n_neighbors': [3, 5, 7, 9, 11], 'metric': ['euclidean', 'manhattan', 'minkowski'], 'weights': ['uniform', 'distance'] # uniform vs weighted } grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy') grid.fit(X_train_s, y_train) print(f"Best params: {grid.best_params_}") print(f"CV accuracy: {grid.best_score_:.4f}") print(f"Test accuracy: {grid.score(X_test_s, y_test):.4f}") print(classification_report(y_test, grid.predict(X_test_s)))

8. Weighted KNN

S.A. Dudani proposed distance-weighted KNN in his 1976 paper "The Distance-Weighted k-Nearest-Neighbor Rule."

Why is weighting needed? In standard KNN, all K neighbors vote with equal weight -- a neighbor at distance 0.1 has the same influence as one at distance 9.9. This is intuitively wrong: closer neighbors should be more trustworthy (because the closer the distance, the higher the probability of sharing the same label).

Weighting scheme: The most common weight function is the inverse of distance: w_i = 1/d(x_q, x_i). This way, a neighbor at distance 0.1 has 100 times the weight of one at distance 10.

Weighted vote: y_pred = argmax_c SUM_{i in N_K} w_i * I(y_i = c) Weighted regression: y_pred = SUM_{i in N_K} w_i * y_i / SUM_{i in N_K} w_i Where w_i = 1 / d(x_q, x_i) (inverse distance weight)

Benefits of weighted KNN: (1) More robust to K selection -- even if some distant neighbors are included, their low weights limit their impact; (2) smoother decision boundaries; (3) in sklearn, simply set weights='distance'.

9. When to Use / Not Use KNN

ConditionUse KNN?Reason
Small dataset (<10K samples)YesPrediction time is acceptable; KNN is competitive on small datasets
Low-dimensional features (<20D)YesDistance metrics are effective; avoids curse of dimensionality
Nonlinear decision boundariesYesKNN automatically adapts to arbitrarily shaped boundaries
Rapid prototyping neededYesNo training required; immediately usable
Data updated frequentlyYesNew data can be appended to the training set; no retraining needed
Large dataset (>100K samples)NoPrediction time O(n*d) is too slow (unless using KD-Tree/Ball Tree)
High-dimensional features (>50D)NoCurse of dimensionality renders distances meaningless
Interpretability requiredNoKNN cannot output feature importance or rules
Real-time prediction neededNoEach prediction requires scanning the entire training set
Features contain heavy noiseNoNoisy dimensions corrupt distance calculations

KNN vs Other Algorithms

DimensionKNNSVMDecision TreeLogistic Regression
Training timeO(1) (no training)O(n^2) ~ O(n^3)O(n*d*log n)O(n*d*iterations)
Prediction timeO(n*d)O(sv*d)O(tree depth)O(d)
Decision boundaryArbitrary nonlinearKernel-dependentAxis-aligned rectanglesLinear
Needs scalingYes (mandatory)YesNoRecommended
InterpretabilityLowLowHighHigh
High-D performancePoorGood (kernel trick)ModerateGood

11. FAQ

Q: Can KNN be used for regression?

Yes. KNN regression works identically -- after finding K nearest neighbors, instead of majority vote, it takes the mean (or distance-weighted mean) of their target values. Sklearn provides KNeighborsRegressor. KNN regression works well when the target has strong local relationships with features, but it is sensitive to outliers.

Q: Can KNN's time complexity be optimized?

Yes. Brute-force search is O(n*d), but spatial index structures can accelerate it: KD-Tree (Bentley, 1975) reduces queries to O(d*log n) in low dimensions (d < 20); Ball Tree (Omohundro, 1989) works better in medium-to-high dimensions. Sklearn's algorithm='auto' automatically selects the best method. For very large datasets, approximate nearest neighbor methods like FAISS (Facebook) or Annoy (Spotify) can be used.

Q: How does KNN handle class imbalance?

With class imbalance, majority-class samples are more likely to appear in the neighborhood, drowning out the minority class. Solutions: (1) use distance weighting (weights='distance') so closer neighbors have more influence; (2) oversample the minority class (SMOTE); (3) reduce K to focus on local structure; (4) modify the voting rule to use weighted proportions rather than absolute counts.

Q: What is the difference between KNN and K-Means?

Despite both having K in the name, they are completely different algorithms. KNN is supervised learning (classification/regression) where K is the number of neighbors, requiring no training. K-Means is unsupervised learning (clustering) where K is the number of clusters, optimizing cluster centers iteratively. KNN's K is a prediction-time hyperparameter; K-Means' K is a training-time hyperparameter.

Q: Why is KNN's training error always 0 when K=1?

When K=1, each training sample's nearest neighbor is itself (distance 0), so the predicted label always equals the true label, giving zero training error. This is why training error is a misleading metric for KNN -- it is always optimistically low. You must use cross-validation or a held-out test set to evaluate true performance.