KNN 算法详解
Table of Contents
1. What is KNN
K-Nearest Neighbors (KNN) was formally introduced by Thomas Cover and Peter Hart in their 1967 paper "Nearest Neighbor Pattern Classification". It is one of the most intuitive algorithms in machine learning: given an unknown sample, find the K closest samples in the training set and use their labels to vote on the prediction.
KNN is called a "lazy learner". Why "lazy"? Because KNN performs zero computation during the training phase -- it simply stores the training data in memory. All computation (distance calculation, sorting, voting) is deferred to prediction time. This contrasts with "eager learners" (like logistic regression and SVM) that build explicit model parameters during training.
2. How KNN Works
The KNN workflow can be decomposed into the following steps, each with a clear mathematical rationale:
- Store training data: Save all training samples (x_i, y_i) in memory. Why no preprocessing? Because KNN assumes the prediction data follows the same distribution as training data, and preserving raw data maximizes information retention.
- Compute distances: Given a query point x_q, compute its distance d(x_q, x_i) to every training sample x_i. Why use distance as a similarity measure? Because in the feature space, similar samples should be geometrically close -- this is KNN's core assumption (local consistency assumption).
- Sort and select neighbors: Sort by distance in ascending order and select the K nearest samples. Why not just pick the single nearest one? Because a single sample is susceptible to noise (high variance), and voting across multiple neighbors smooths out noise.
- Vote or average: Classification: the most frequent class among the K neighbors becomes the prediction (majority vote). Regression: average the target values of the K neighbors. Why majority vote? Because in a local region, the most common class is most likely the true class (an intuitive version of the maximum likelihood principle).
- Output prediction: Return the vote result or average value.
3. Distance Metrics
The choice of distance metric directly determines "who is a neighbor," making it critical to KNN performance. Different metrics suit different data characteristics, and each exists for specific mathematical and practical reasons.
3.1 Euclidean Distance (L2)
Why is it the default? Euclidean distance is the mathematical formalization of the intuitive "straight-line distance." It is rotationally invariant -- the distance between two points remains the same regardless of how the coordinate system is rotated. This means it has no directional bias and treats all dimensions equally. When features are on the same scale and equally important, Euclidean distance is the most natural choice.
Proposed by: Based on Euclidean geometry (circa 300 BC), adopted as the default in modern KNN by Cover & Hart (1967).
Best for: Continuous numerical features, standardized features, low-to-medium dimensionality.
3.2 Manhattan Distance (L1)
Why does it exist? In high-dimensional sparse data, Euclidean distance suffers from "distance concentration" -- all pairwise distances converge to similar values, destroying discriminative power. Aggarwal, Hinneburg, and Keim proved in their 2001 paper "On the Surprising Behavior of Distance Metrics in High Dimensional Space" that as dimensionality increases, L1 (Manhattan) distance preserves better contrast than L2 (Euclidean). The mathematical reason is that L1 has more uniform sensitivity to per-dimension differences, unlike L2 where a few large differences dominate due to squaring.
Best for: High-dimensional sparse data, grid-like path problems, features with large value ranges where squaring amplification is undesirable.
3.3 Minkowski Distance (Lp)
Why does it exist? Minkowski distance is the unified generalization of L1 and L2, proposed by Hermann Minkowski (1896). The parameter p provides a continuous knob to adjust between L1 and L2 behavior. Smaller p values are more "democratic" across dimensions (each contributes similarly), while larger p values let the dimension with the largest difference dominate. Choosing the optimal p via cross-validation can tailor the metric to the specific data distribution.
Best for: When you are unsure whether L1 or L2 is better -- treat p as a tunable hyperparameter.
3.4 Cosine Distance
Why does it exist? In text classification (TF-IDF vectors) and recommendation systems, what matters is the direction of two vectors, not their magnitude. For example, a document mentioning "machine learning" 10 times and another mentioning it 100 times may have the same topic -- they just differ in length. Cosine distance ignores vector magnitude (it is magnitude-invariant), comparing only direction. This captures "topic similarity" rather than "quantity similarity," which is why cosine distance is the standard choice in NLP.
Proposed by: Developed in the information retrieval field, popularized by Gerard Salton (1971) in the vector space model.
Best for: Text classification, document similarity, recommendation systems, high-dimensional sparse vectors.
3.5 Distance Metric Comparison
| Metric | Properties | Best For | Weakness |
|---|---|---|---|
| Euclidean (L2) | Rotationally invariant; influenced by large per-dimension differences | Continuous features, low-D, standardized data | Contrast degrades in high-D |
| Manhattan (L1) | Uniform per-dimension contribution; more robust to outliers | High-D sparse data, many discrete features | Not rotationally invariant |
| Minkowski (Lp) | Tunable p parameter balances L1/L2 behavior | When metric tuning is needed | Extra hyperparameter adds complexity |
| Cosine | Direction-only; ignores magnitude | Text/NLP, recommendation systems | Unsuitable when magnitude is meaningful |
4. Choosing K
Choosing K is the most critical hyperparameter decision in KNN, directly controlling the bias-variance tradeoff.
4.1 Bias-Variance Tradeoff Explained
Small K (e.g., K=1) -- Low bias / High variance: The decision boundary tightly follows every training point, capturing fine local patterns. Why low bias? Because the model almost perfectly fits the training data (1-NN has zero training error). Why high variance? Because a small perturbation in the data (adding or removing one sample) can significantly change the decision boundary. The model is oversensitive to noise and prone to overfitting.
Large K (e.g., K=n) -- High bias / Low variance: The decision boundary becomes extremely smooth. Why high bias? Because the model over-averages, ignoring local structure (in the extreme case K=n, every query predicts the same class -- the majority class). Why low variance? Because voting over many neighbors makes predictions very stable, unaffected by individual samples.
4.2 Why Use Odd K for Binary Classification
In binary classification, if K is even (e.g., K=4), you may get a 2-vs-2 tie, forcing the algorithm to use a tiebreaker rule (such as random selection or nearest-neighbor preference), which introduces nondeterminism. Using odd K mathematically guarantees no ties (since an odd number cannot be evenly split), making predictions deterministic.
Note: In multiclass problems, ties can still occur with odd K (e.g., 3 classes each receiving 1 vote with K=3), so the odd-K recommendation applies primarily to binary classification.
4.3 Cross-Validation for K Selection
The most reliable approach in practice is to use k-fold cross-validation (note: lowercase k here differs from the KNN parameter K) to select the optimal K. A common procedure is to test odd values in the range K = 1, 3, 5, 7, ..., sqrt(n), selecting the K with the highest validation accuracy.
Rule of thumb: sqrt(n) is a reasonable upper bound for K (where n is the number of training samples). Why? Because if K is much larger than sqrt(n), the neighborhood spans too much of the data space, destroying locality.
5. Curse of Dimensionality
The Curse of Dimensionality is the core reason KNN fails on high-dimensional data. The term was coined by Richard Bellman in 1961 in his book "Adaptive Control Processes."
5.1 Why KNN Fails in High Dimensions
Core problem: As dimensionality d increases, distances between data points converge to similar values, making the difference between "nearest neighbor" and "farthest neighbor" negligible. Why does this happen?
Mathematical explanation: Consider n points uniformly distributed in a d-dimensional unit hypercube [0,1]^d. The expected Euclidean distance between two random points is E[d] ~ sqrt(d/6), while the standard deviation is Std[d] ~ O(1). Therefore:
When nearest and farthest neighbors are nearly equidistant, the concept of "neighbor" based on distance loses meaning -- KNN's core assumption (nearby points share labels) no longer holds.
5.2 Hughes Phenomenon
Gordon Hughes (1968) discovered a counterintuitive phenomenon in his paper "On the Mean Accuracy of Statistical Pattern Recognizers": with a fixed number of training samples, classifier performance first improves then degrades as feature dimensionality increases. Why?
Adding dimensions initially provides more information, but beyond a critical point the high-dimensional space becomes extremely sparse (data points cannot adequately cover the feature space), and the model begins fitting noise dimensions rather than useful signal. The required number of training samples grows exponentially with dimensionality -- to maintain the same data density in d dimensions, you need n^(d/d_0) times more samples than in lower dimensions.
5.3 Solution: Dimensionality Reduction Before KNN
PCA (Principal Component Analysis) + KNN is the classic combination strategy. Why does PCA help? PCA projects data onto the directions of maximum variance via linear transformation, eliminating redundant and noisy dimensions while preserving the most informative components. After reduction, data density increases and distance metrics become meaningful again.
Other reduction methods: t-SNE (visualization only, not suitable for KNN prediction), UMAP (faster and preserves global structure), feature selection (e.g., mutual information, L1 regularization).
6. Feature Scaling
Feature scaling is not optional for KNN -- it is a mandatory preprocessing step. Why?
Core reason: KNN is distance-based, and differing scales across features cause some features to unfairly dominate the distance calculation. For example, "age" ranges [0, 100] while "annual income" ranges [0, 1,000,000] -- without scaling, Euclidean distance is almost entirely determined by income, and age has negligible influence.
Scaling Methods Compared
| Method | Formula | Best For | Why Choose It |
|---|---|---|---|
| StandardScaler | (x - mean) / std | Normally distributed data | Preserves distribution shape, centers to mean 0, std 1 |
| MinMaxScaler | (x - min) / (max - min) | Bounded data | Maps all features to [0,1], good when no outliers |
| RobustScaler | (x - median) / IQR | Data with outliers | Based on median and IQR, robust to outliers |
7. Python Implementation
7.1 KNN from Scratch (20 Lines)
The following code demonstrates the core logic of a KNN classifier, helping you understand each step of the algorithm:
7.2 Using sklearn
8. Weighted KNN
S.A. Dudani proposed distance-weighted KNN in his 1976 paper "The Distance-Weighted k-Nearest-Neighbor Rule."
Why is weighting needed? In standard KNN, all K neighbors vote with equal weight -- a neighbor at distance 0.1 has the same influence as one at distance 9.9. This is intuitively wrong: closer neighbors should be more trustworthy (because the closer the distance, the higher the probability of sharing the same label).
Weighting scheme: The most common weight function is the inverse of distance: w_i = 1/d(x_q, x_i). This way, a neighbor at distance 0.1 has 100 times the weight of one at distance 10.
Benefits of weighted KNN: (1) More robust to K selection -- even if some distant neighbors are included, their low weights limit their impact; (2) smoother decision boundaries; (3) in sklearn, simply set weights='distance'.
9. When to Use / Not Use KNN
| Condition | Use KNN? | Reason |
|---|---|---|
| Small dataset (<10K samples) | Yes | Prediction time is acceptable; KNN is competitive on small datasets |
| Low-dimensional features (<20D) | Yes | Distance metrics are effective; avoids curse of dimensionality |
| Nonlinear decision boundaries | Yes | KNN automatically adapts to arbitrarily shaped boundaries |
| Rapid prototyping needed | Yes | No training required; immediately usable |
| Data updated frequently | Yes | New data can be appended to the training set; no retraining needed |
| Large dataset (>100K samples) | No | Prediction time O(n*d) is too slow (unless using KD-Tree/Ball Tree) |
| High-dimensional features (>50D) | No | Curse of dimensionality renders distances meaningless |
| Interpretability required | No | KNN cannot output feature importance or rules |
| Real-time prediction needed | No | Each prediction requires scanning the entire training set |
| Features contain heavy noise | No | Noisy dimensions corrupt distance calculations |
KNN vs Other Algorithms
| Dimension | KNN | SVM | Decision Tree | Logistic Regression |
|---|---|---|---|---|
| Training time | O(1) (no training) | O(n^2) ~ O(n^3) | O(n*d*log n) | O(n*d*iterations) |
| Prediction time | O(n*d) | O(sv*d) | O(tree depth) | O(d) |
| Decision boundary | Arbitrary nonlinear | Kernel-dependent | Axis-aligned rectangles | Linear |
| Needs scaling | Yes (mandatory) | Yes | No | Recommended |
| Interpretability | Low | Low | High | High |
| High-D performance | Poor | Good (kernel trick) | Moderate | Good |
11. FAQ
Yes. KNN regression works identically -- after finding K nearest neighbors, instead of majority vote, it takes the mean (or distance-weighted mean) of their target values. Sklearn provides KNeighborsRegressor. KNN regression works well when the target has strong local relationships with features, but it is sensitive to outliers.
Yes. Brute-force search is O(n*d), but spatial index structures can accelerate it: KD-Tree (Bentley, 1975) reduces queries to O(d*log n) in low dimensions (d < 20); Ball Tree (Omohundro, 1989) works better in medium-to-high dimensions. Sklearn's algorithm='auto' automatically selects the best method. For very large datasets, approximate nearest neighbor methods like FAISS (Facebook) or Annoy (Spotify) can be used.
With class imbalance, majority-class samples are more likely to appear in the neighborhood, drowning out the minority class. Solutions: (1) use distance weighting (weights='distance') so closer neighbors have more influence; (2) oversample the minority class (SMOTE); (3) reduce K to focus on local structure; (4) modify the voting rule to use weighted proportions rather than absolute counts.
Despite both having K in the name, they are completely different algorithms. KNN is supervised learning (classification/regression) where K is the number of neighbors, requiring no training. K-Means is unsupervised learning (clustering) where K is the number of clusters, optimizing cluster centers iteratively. KNN's K is a prediction-time hyperparameter; K-Means' K is a training-time hyperparameter.
When K=1, each training sample's nearest neighbor is itself (distance 0), so the predicted label always equals the true label, giving zero training error. This is why training error is a misleading metric for KNN -- it is always optimistically low. You must use cross-validation or a held-out test set to evaluate true performance.