SVM 支持向量机详解
Table of Contents
1. What is SVM (Support Vector Machine)
A Support Vector Machine (SVM) is a powerful supervised learning algorithm primarily used for classification tasks (though it also supports regression via SVR). The core idea is to find an optimal hyperplane in feature space that maximizes the margin between different classes.
History and Naming
SVM's theoretical foundation traces back to Vladimir Vapnik and Alexei Chervonenkis, who proposed the theory of linear classifiers in 1963. They studied how to control a model's generalization ability given finite samples, laying the groundwork for statistical learning theory. However, the practical breakthrough came in 1992 when Bernhard Boser, Isabelle Guyon, and Vladimir Vapnik published their landmark paper introducing the kernel trick into the maximum-margin classifier, enabling SVM to handle nonlinear problems.
The name "support vectors" comes from the algorithm's geometric essence: in the final trained model, only the data points closest to the decision boundary actually participate in defining that boundary -- these critical points are called support vectors. They "support" the decision boundary's position, like poles holding up a tent. Removing any support vector would change the decision boundary, while removing other data points would have no effect on the model.
2. Mathematical Foundation
2.1 The Hyperplane Equation
In d-dimensional space, a hyperplane is a (d-1)-dimensional subspace defined by:
Where w is the normal vector (determines the hyperplane's orientation), b is the bias term (determines the distance from the origin), and x is the input feature vector. For binary classification, the hyperplane divides space into two halves: w · x + b > 0 for the positive class and w · x + b < 0 for the negative class.
2.2 Why Maximize the Margin? -- Vapnik's Structural Risk Minimization
There are infinitely many hyperplanes that can separate two classes. Why does SVM choose the one with the largest margin? This is not an arbitrary design choice -- it has deep theoretical justification.
2.3 Deriving the Margin
The distance from any point x to the hyperplane w · x + b = 0 is:
We adopt the convention that support vectors satisfy |w · x + b| = 1 (this can always be achieved by rescaling w and b). Then the distance from a support vector to the hyperplane is 1/||w||, and the total margin between support vectors on both sides is:
Maximizing the margin 2/||w|| is equivalent to minimizing ||w||. For mathematical convenience (avoiding the square root in derivatives), we minimize ½||w||². The final optimization problem:
s.t. yi(w · xi + b) ≥ 1, ∀i
The constraint yi(w · xi + b) ≥ 1 ensures every sample is correctly classified and at least 1/||w|| from the hyperplane. This is a convex quadratic optimization problem that can be efficiently solved via Lagrangian duality.
3. Hard vs Soft Margin
3.1 Hard-Margin SVM
The formulation above is the hard-margin SVM -- it requires all samples to be correctly classified and satisfy the margin constraint. But real-world data is almost never perfectly linearly separable:
Fatal flaws of hard margin: (1) If the data is not linearly separable, the optimization problem has no solution; (2) Even if linearly separable, individual noise points or outliers can severely distort the decision boundary, causing overfitting.
3.2 Soft-Margin SVM (Cortes & Vapnik, 1995)
Corinna Cortes and Vladimir Vapnik published their groundbreaking paper "Support-vector networks" in 1995, introducing slack variables ξi to allow some samples to violate the margin constraint. This was the key step that transformed SVM from a theoretical tool into a practical algorithm.
s.t. yi(w · xi + b) ≥ 1 - ξi, ξi ≥ 0, ∀i
3.3 The C Parameter Trade-off
Large C: Harshly penalizes misclassifications → smaller margin, complex boundary → prone to overfitting (sensitive to noise)
Small C: Tolerates misclassifications → larger margin, simpler boundary → prone to underfitting (ignores data patterns)
Mathematically, C acts like the inverse of a regularization parameter. Larger C means weaker regularization; smaller C means stronger regularization.
4. The Kernel Trick
The kernel trick is SVM's most elegant and important idea. It solves a fundamental problem: real-world data is often not linearly separable, but may become linearly separable in a higher-dimensional space.
4.1 Why Do We Need Kernels?
Key insight: A nonlinearly separable problem in low-dimensional space, when mapped to a sufficiently high-dimensional space, is almost certainly linearly separable (Cover's Theorem, 1965). For example, the 2D XOR problem can be separated by a plane in 3D space.
The problem: Directly computing the high-dimensional mapping φ(x) is prohibitively expensive -- if mapping to a million-dimensional space, storage and computation are impractical.
The kernel genius: Through the dual formulation, SVM's optimization and prediction only require computing inner products between data points: φ(xi) · φ(xj). A kernel function K(xi, xj) can compute this inner product directly in the original space, without ever explicitly mapping to the high-dimensional space.
4.2 Historical Background
The idea of kernel methods was first proposed by Aizerman, Braverman, and Rozonoer in 1964 in their work on potential functions. In 1992, Boser, Guyon, and Vapnik combined kernels with the maximum-margin classifier to create the modern SVM. Their key insight: the dual problem's objective function contains only inner products of input vectors, which can be replaced by kernel functions.
4.3 Mercer's Theorem
4.4 Common Kernels Explained
Linear Kernel
K(xi, xj) = xi · xj
Why it exists: When data is linearly separable or nearly so, no high-dimensional mapping is needed. The linear kernel is equivalent to using no kernel, giving the best computational efficiency.
Best for: High-dimensional sparse data (e.g., text classification with tens of thousands of features), or when the number of features far exceeds the number of samples. Joachims (1998) demonstrated that linear SVM excels at text classification.
RBF / Gaussian Kernel (Radial Basis Function)
K(xi, xj) = exp(-γ ||xi - xj||²)
Why it's powerful: The RBF kernel implicitly maps data to an infinite-dimensional feature space. This can be proven via Taylor expansion: exp(-γ||x-y||²) expands into polynomial terms of all degrees, each corresponding to a dimension, hence an infinite-dimensional space. By Cover's theorem, data is almost certainly linearly separable in infinite dimensions.
The γ parameter: γ controls the influence radius of each sample. Large γ → each sample only influences nearby regions → complex boundary (overfitting risk). Small γ → each sample influences a wide area → smooth boundary (underfitting risk).
Default choice: Hsu, Chang & Lin (2003) recommend: when unsure which kernel to use, try RBF first. It is flexible enough and has only two hyperparameters (C, γ) to tune.
Polynomial Kernel
K(xi, xj) = (γ xi · xj + r)d
Why it exists: When interaction relationships between features are important (i.e., cross-feature effects matter), the polynomial kernel can explicitly model d-th order feature interactions without manually engineering cross features. The degree d controls the nonlinearity level.
Best for: NLP (bag-of-words interactions), certain structured features in image recognition. Less commonly used than RBF in practice because it has more hyperparameters (d, γ, r) and worse numerical stability.
Sigmoid Kernel
K(xi, xj) = tanh(γ xi · xj + r)
Why it exists: The sigmoid kernel derives from the neural network activation function tanh. An SVM with a sigmoid kernel is in some sense equivalent to a two-layer perceptron neural network. It exists primarily for historical reasons -- to establish connections between neural networks and SVMs.
Caveat: The sigmoid kernel does not always satisfy Mercer's condition (it is a valid kernel only for certain values of γ and r), so it is rarely used in practice.
5. Hyperparameter Tuning
Hsu, Chang & Lin (2003), in their paper cited over 20,000 times -- "A Practical Guide to Support Vector Classification" -- established the standard methodology for SVM hyperparameter tuning that remains the most authoritative reference to this day.
5.1 The C Parameter
Mathematical meaning: C is the regularization parameter in the soft-margin optimization. In the dual problem, C is the upper bound on the Lagrange multipliers αi (0 ≤ αi ≤ C).
Intuition: C answers the question: "How much are you willing to pay to correctly classify every single training sample?"
Tuning range: Hsu et al. recommend searching in exponential steps from 2-5 to 215, i.e., C ∈ {2-5, 2-3, ..., 213, 215}.
5.2 The γ Parameter (RBF Kernel)
Mathematical meaning: γ = 1/(2σ²), where σ is the standard deviation of the Gaussian function. γ defines the "width" of the Gaussian kernel.
Intuition: γ answers the question: "How far does each training sample's influence reach?"
Tuning range: Recommended search range: 2-15 to 23, i.e., γ ∈ {2-15, 2-13, ..., 21, 23}.
5.3 Grid Search + Cross-Validation
1. Preprocessing: Scale features to [-1, 1] or [0, 1] (SVM is scale-sensitive because distance computation is affected by feature magnitude)
2. Try RBF kernel first
3. Coarse search: Exponential grid for C and γ + cross-validation
4. Fine search: Refine around the best region from coarse search
5. Train the final model on the full training set with the best (C, γ)
6. Python Implementation
6.1 Basic SVM Classification
6.2 Kernel Comparison Experiment
7. Multi-class SVM
SVM is inherently a binary classifier. For multi-class problems (K classes), there are two classic strategies:
One-vs-One (OvO)
Method: Train one SVM for every pair of classes, yielding K(K-1)/2 classifiers. At prediction time, each classifier votes, and the class with the most votes wins.
Why OvO: Each sub-problem uses only two classes' data, so training sets are small and training is fast. Although there are many classifiers, each trains quickly. Sklearn uses OvO by default because Hsu & Lin (2002) found experimentally that OvO is typically as good as other methods and trains faster.
One-vs-All / One-vs-Rest (OvA / OvR)
Method: Train one "this class vs. all others" SVM per class, yielding K classifiers. At prediction, choose the class whose decision function value is largest.
Why OvA: Fewer classifiers (K vs. K(K-1)/2), and each classifier uses all training data. The downside is worse class imbalance (1 class vs. all the rest).
8. Advantages & Limitations
1. Strong generalization: Based on Structural Risk Minimization (SRM), not just Empirical Risk Minimization. Margin maximization directly minimizes the upper bound on generalization error.
2. Effective in high dimensions: Even when features far exceed samples (e.g., genomics: d=20,000, n=100), SVM works well because it relies on support vectors, not all data points.
3. Flexible via kernels: Kernel functions handle arbitrary nonlinearity without explicit high-dimensional mapping.
4. Global optimum: The optimization is convex -- no local optima issues (unlike neural networks).
5. Sparse solution: The final model depends only on support vectors, making it memory-efficient.
1. Slow on large datasets: Training complexity is O(n²) to O(n³) (standard QP solver), impractical beyond ~100K samples. Reason: the n×n kernel matrix must be computed.
2. Scale-sensitive: Distance-based, so features must be standardized first. Unlike tree models, SVM is not scale-invariant.
3. No native probabilities: SVM outputs decision function values, not probabilities. Platt Scaling (Platt, 1999) can convert to probabilities, but it's post-hoc and adds computation.
4. Hyperparameter-sensitive: C and γ choices significantly impact performance and require careful tuning.
5. Limited interpretability: Nonlinear-kernel SVMs are essentially black boxes -- you cannot directly explain feature contributions as you can with decision trees or linear models.
9. SVM vs Other Algorithms
| Dimension | SVM | Logistic Regression | Random Forest | KNN | Neural Networks |
|---|---|---|---|---|---|
| Theory | Structural Risk Minimization | Maximum Likelihood | Bagging + Random Subspace | Lazy Learning | Universal Approximation |
| Nonlinearity | Via kernels | Manual feature engineering | Natively nonlinear | Natively nonlinear | Natively nonlinear |
| Training speed | O(n²~n³) slow | O(nd) fast | O(n·d·log n·T) med | None needed | Depends on arch |
| Prediction speed | O(sv·d) med | O(d) fast | O(T·log n) med | O(nd) slow | O(d) fast |
| Small-data perf | Excellent | Good | Medium | Good | Poor |
| Large-data perf | Limited by compute | Good | Excellent | Limited by compute | Excellent |
| Feature scaling | Required | Recommended | Not needed | Required | Required |
| Interpretability | Linear kernel only | High | Medium (importances) | High (neighbors) | Low |
| Probabilities | Indirect (Platt) | Direct | Direct | Direct | Direct (Softmax) |
| Best scenario | Small-medium data, high-dim, nonlinear | Linear, need probs | Large data, mixed features | Small data, local structure | Massive data, complex patterns |
11. FAQ
Standard SVM uses a quadratic programming (QP) solver that requires computing and storing the n×n kernel matrix, with time complexity O(n²) to O(n³) and memory O(n²). This is computationally infeasible for 100K+ samples. Alternatives: (1) Use LinearSVC (based on liblinear, O(nd) training for linear kernel); (2) Use SGD approximation (sklearn's SGDClassifier(loss='hinge')); (3) Use Nystroem or RBF Sampler to approximate kernel mapping, then train a linear model; (4) Use GPU-accelerated implementations like ThunderSVM.
Rules of thumb: (1) If features >> samples (e.g., text classification with d=10,000, n=1,000), use linear -- data is already in high-dimensional space, nonlinear mapping is unnecessary and may overfit; (2) If samples >> features and data is nonlinear, use RBF; (3) If unsure, try RBF + grid search first (Hsu et al. 2003 recommendation).
Yes. Support Vector Regression (SVR, Drucker et al. 1997) uses an epsilon-insensitive loss function: loss is only computed when the prediction deviates from the true value by more than epsilon. This creates a tube-shaped regression band where errors within the tube are ignored. In sklearn: SVR(kernel='rbf', C=1.0, epsilon=0.1).
If the number of support vectors approaches the training set size (e.g., exceeds 50%), it typically indicates: (1) C is too small, making the model too lenient; (2) The kernel choice is poor and cannot effectively separate the data; (3) The data itself is very noisy or has heavily overlapping classes. Ideally, support vectors should be a small fraction of the training set. Fewer support vectors means a sparser, better-generalizing model.
SVM is distance-based (margin = 2/||w||, kernel functions use ||xi - xj||). If one feature's scale is far larger than others (e.g., "age 0-100" vs. "income 0-1,000,000"), distance calculations will be dominated by the large-scale feature, and the small-scale feature's information is effectively ignored. After standardization, each feature contributes equally to distances. This differs from tree models, which split on feature value ranks and are scale-invariant.