线性回归详解

What is Linear Regression

Linear regression is one of the most fundamental and important algorithms in statistics and machine learning. The core idea is simple: find the "best fit line" that describes the linear relationship between independent variables (X) and a dependent variable (y), enabling predictions of continuous numerical values.

Intuitive Understanding

Imagine you have a set of "house area vs. price" data points scattered on a 2D plane. Linear regression finds a straight line that minimizes the total "distance" from all data points to the line. With this line, given any area, you can predict the corresponding price — that is the essence of linear regression.

When to Use It

Linear regression is used to predict continuous values (not categorical labels). Typical applications include: house price prediction, sales forecasting, temperature prediction, stock trend analysis, and modeling the relationship between ad spend and revenue.

Although "linear" sounds simple, linear regression is the cornerstone of many advanced models. Understanding its mathematical principles greatly helps when learning logistic regression, neural networks, SVMs, and more. Mastering linear regression is the first key to mastering machine learning.

Mathematical Foundation

Simple Linear Regression

With one independent variable x and one dependent variable y, the model is:

y = wx + b

Where w (weight/slope) represents how much y changes for each unit increase in x; b (bias/intercept) represents the value of y when x = 0.

Multiple Linear Regression

When there are multiple independent variables, the model extends to:

y = w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ + b

In matrix form, this can be expressed concisely as:

Y = Xβ + ε Where: Y = [y₁, y₂, ..., yₙ]ᵀ (n×1 target vector) X = [[1, x₁₁, x₁₂, ..., x₁ₚ], (n×(p+1) design matrix, first column all 1s) [1, x₂₁, x₂₂, ..., x₂ₚ], ...] β = [b, w₁, w₂, ..., wₚ]ᵀ (parameter vector) ε = [ε₁, ε₂, ..., εₙ]ᵀ (error vector)

Cost Function: Mean Squared Error (MSE)

We need a metric to measure "how well the line fits." The most common is Mean Squared Error (MSE):

MSE = (1/n) Σ(yᵢ - ŷᵢ)² = (1/n) Σ(yᵢ - (wxᵢ + b))² Where yᵢ is the actual value, ŷᵢ is the predicted value, n is the number of samples

The smaller the MSE, the better the fit. The training goal is to find the parameters w and b that minimize MSE.

Normal Equation — Closed-Form Solution

By taking the derivative of MSE and setting it to zero, we can directly obtain the closed-form solution:

β = (XᵀX)⁻¹ Xᵀy

The normal equation computes the optimal solution in one step but requires matrix inversion. When the number of features is large (e.g., 10,000+), the computational complexity of matrix inversion is O(p³), making gradient descent more efficient.

Gradient Descent — Iterative Optimization

Gradient descent iteratively adjusts parameters to minimize the cost function. The update rule at each iteration is:

w = w - α * (∂MSE/∂w) b = b - α * (∂MSE/∂b) Where: ∂MSE/∂w = (-2/n) Σ xᵢ(yᵢ - ŷᵢ) ∂MSE/∂b = (-2/n) Σ (yᵢ - ŷᵢ) α = learning rate, controls the step size of each update

Four Key Assumptions of Linear Regression

For linear regression to produce reliable results, four core assumptions must hold. Violating them doesn't mean the model is useless, but it affects the reliability of parameter estimates and accuracy of confidence intervals.

1. Linearity

A linear relationship exists between independent and dependent variables. This can be checked with scatter plots. If the relationship is nonlinear (e.g., quadratic), polynomial features can be added.

2. Independence

Observations are independent of each other; residuals show no autocorrelation. This is easily violated with time series data. The Durbin-Watson test can detect this (a value near 2 indicates no autocorrelation).

3. Homoscedasticity

The variance of residuals remains constant across all levels of predicted values. If residuals form a "funnel shape" (variance increases with predicted values), heteroscedasticity is present. This can be mitigated by log-transforming y or using weighted least squares.

4. Normality of Errors

Residuals should approximately follow a normal distribution (mean = 0). This can be verified with Q-Q plots or the Shapiro-Wilk test. With large samples (n > 30), the Central Limit Theorem reduces the impact of this assumption.

Python Implementation from Scratch

Implementing gradient descent linear regression with NumPy, without any ML library:

import numpy as np class LinearRegressionGD: """Linear Regression from scratch using Gradient Descent""" def __init__(self, learning_rate=0.01, n_iterations=1000): self.lr = learning_rate self.n_iter = n_iterations self.weights = None self.bias = None self.cost_history = [] # Track cost at each iteration def fit(self, X, y): n_samples, n_features = X.shape # Initialize parameters to zero self.weights = np.zeros(n_features) self.bias = 0 for i in range(self.n_iter): # Forward pass: compute predictions y_pred = np.dot(X, self.weights) + self.bias # Compute cost (MSE) cost = np.mean((y - y_pred) ** 2) self.cost_history.append(cost) # Compute gradients dw = (-2 / n_samples) * np.dot(X.T, (y - y_pred)) db = (-2 / n_samples) * np.sum(y - y_pred) # Update parameters self.weights -= self.lr * dw self.bias -= self.lr * db return self def predict(self, X): return np.dot(X, self.weights) + self.bias def score(self, X, y): """Compute R² coefficient of determination""" y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return 1 - (ss_res / ss_tot) # ===== Usage Example ===== # Generate synthetic data np.random.seed(42) X = 2 * np.random.rand(100, 1) # 100 samples, 1 feature y = 4 + 3 * X.squeeze() + np.random.randn(100) * 0.5 # y = 3x + 4 + noise # Feature scaling (speeds up convergence) X_mean, X_std = X.mean(axis=0), X.std(axis=0) X_scaled = (X - X_mean) / X_std # Train the model model = LinearRegressionGD(learning_rate=0.1, n_iterations=500) model.fit(X_scaled, y) print(f"R² = {model.score(X_scaled, y):.4f}") print(f"Final MSE: {model.cost_history[-1]:.4f}")

Normal Equation Implementation

def normal_equation(X, y): """Normal equation closed-form: beta = (X^T X)^(-1) X^T y""" # Add bias column (column of 1s) X_b = np.c_[np.ones((X.shape[0], 1)), X] # Compute optimal parameters beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y return beta # beta[0] = intercept, beta[1:] = weights beta = normal_equation(X, y) print(f"Intercept: {beta[0]:.4f}, Weight: {beta[1]:.4f}") # Output close to 4.0 and 3.0 (our ground truth)

Sklearn Implementation

In practice, use scikit-learn's LinearRegression. It internally uses the normal equation (via SVD decomposition) for better numerical stability:

from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error import numpy as np # Prepare data np.random.seed(42) X = np.random.rand(200, 3) # 200 samples, 3 features y = 5 + 2*X[:, 0] - 3*X[:, 1] + 1.5*X[:, 2] + np.random.randn(200) * 0.3 # Train/test split (80/20) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train the model model = LinearRegression() model.fit(X_train, y_train) # Inspect learned parameters print(f"Intercept (b): {model.intercept_:.4f}") print(f"Coefficients (w): {model.coef_}") # Should be close to [2.0, -3.0, 1.5] # Predict on test set y_pred = model.predict(X_test) # Evaluate the model print(f"\nR² = {r2_score(y_test, y_pred):.4f}") print(f"MSE = {mean_squared_error(y_test, y_pred):.4f}") print(f"MAE = {mean_absolute_error(y_test, y_pred):.4f}")

Sklearn LinearRegression Key Points

- Default fit_intercept=True, handles intercept automatically
- Feature scaling is NOT required (uses SVD internally, not gradient descent)
- No hyperparameters to tune, no regularization (use Ridge/Lasso for that)
- Suitable for moderate number of features (features < 10,000)

Regularization: Ridge / Lasso / ElasticNet

Regularization prevents overfitting by adding a penalty term to the cost function. It's especially important when there are many features or multicollinearity exists.

Ridge Regression (L2 Regularization)

Cost = MSE + α * Σ(wᵢ²) α controls regularization strength. L2 penalty shrinks weights toward small values but not to zero.
from sklearn.linear_model import Ridge from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # Ridge requires feature scaling ridge_pipe = Pipeline([ ('scaler', StandardScaler()), ('ridge', Ridge(alpha=1.0)) # alpha = regularization strength ]) ridge_pipe.fit(X_train, y_train) print(f"Ridge R²: {ridge_pipe.score(X_test, y_test):.4f}") # Use cross-validation to find optimal alpha from sklearn.linear_model import RidgeCV ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5) ridge_cv.fit(X_train, y_train) print(f"Best alpha: {ridge_cv.alpha_}")

Lasso Regression (L1 Regularization)

Cost = MSE + α * Σ|wᵢ| L1 penalty can shrink some weights to exactly zero, achieving feature selection.
from sklearn.linear_model import Lasso, LassoCV # Lasso Regression lasso = Pipeline([ ('scaler', StandardScaler()), ('lasso', Lasso(alpha=0.1)) ]) lasso.fit(X_train, y_train) print(f"Lasso R²: {lasso.score(X_test, y_test):.4f}") # See which features were selected (non-zero weights) lasso_model = lasso.named_steps['lasso'] for i, coef in enumerate(lasso_model.coef_): status = "kept" if abs(coef) > 1e-6 else "dropped" print(f" Feature {i}: w={coef:.4f} ({status})")

ElasticNet (L1 + L2)

Cost = MSE + α * [ρ * Σ|wᵢ| + (1-ρ) * Σ(wᵢ²)] ρ (l1_ratio) controls the mix of L1 and L2. ρ=1 equals Lasso, ρ=0 equals Ridge.
from sklearn.linear_model import ElasticNet, ElasticNetCV # Cross-validate both alpha and l1_ratio enet_cv = ElasticNetCV( l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0], alphas=[0.001, 0.01, 0.1, 1.0], cv=5 ) enet_cv.fit(X_train, y_train) print(f"Best alpha: {enet_cv.alpha_:.4f}") print(f"Best l1_ratio: {enet_cv.l1_ratio_:.2f}") print(f"ElasticNet R²: {enet_cv.score(X_test, y_test):.4f}")

How to Choose a Regularization Method

Method When to Use Feature Selection
Ridge (L2)All features likely useful, just want to shrink weightsNo
Lasso (L1)Suspect many features are irrelevant, want automatic selectionYes
ElasticNetMany features with groups of highly correlated featuresYes (grouped)

Feature Engineering

Feature engineering is the key step to improving linear regression performance. Good features matter more than complex models.

Polynomial Features

When data has a nonlinear relationship, create polynomial features to let the linear model fit curves:

from sklearn.preprocessing import PolynomialFeatures # Original features: [x₁, x₂] # After degree=2: [1, x₁, x₂, x₁², x₁x₂, x₂²] poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X_train) print(f"Original features: {X_train.shape[1]}") print(f"Polynomial features: {X_poly.shape[1]}") # Train linear regression with polynomial features from sklearn.pipeline import make_pipeline poly_model = make_pipeline( PolynomialFeatures(degree=2, include_bias=False), StandardScaler(), Ridge(alpha=1.0) # Polynomial features overfit easily, add regularization ) poly_model.fit(X_train, y_train) print(f"Polynomial R²: {poly_model.score(X_test, y_test):.4f}")

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler # Standardization: mean=0, std=1 (recommended for linear regression) scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) # Min-Max scaling: scale to [0, 1] range minmax = MinMaxScaler() X_minmax = minmax.fit_transform(X_train) # Important: use the training scaler on test set (don't re-fit) X_test_scaled = scaler.transform(X_test) # transform only, no fit!

One-Hot Encoding

Linear regression only handles numerical features. Categorical features must be encoded:

import pandas as pd from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer # Example data df = pd.DataFrame({ 'area': [80, 120, 60, 200], 'rooms': [2, 3, 1, 4], 'city': ['Beijing', 'Shanghai', 'Beijing', 'Shenzhen'], 'price': [300, 500, 200, 800] }) # Automatically handle numerical + categorical features preprocessor = ColumnTransformer(transformers=[ ('num', StandardScaler(), ['area', 'rooms']), ('cat', OneHotEncoder(drop='first'), ['city']) # drop='first' avoids the dummy variable trap ]) X = preprocessor.fit_transform(df[['area', 'rooms', 'city']]) print(f"Encoded feature dimensions: {X.shape}")

Model Evaluation Metrics

Metric Formula Interpretation
1 - SS_res / SS_tot How much variance the model explains. 1 = perfect, 0 = same as mean prediction, negative = worse than mean
MSE (1/n) Σ(yᵢ - ŷᵢ)² Average squared error, penalizes large errors more heavily
RMSE √MSE Same unit as y, more intuitive. RMSE=5 means average error is about 5 units
MAE (1/n) Σ|yᵢ - ŷᵢ| Average absolute error, robust to outliers
Adjusted R² 1 - (1-R²)(n-1)/(n-p-1) R² adjusted for number of features, prevents inflating R² by adding useless features
from sklearn.metrics import ( r2_score, mean_squared_error, mean_absolute_error ) import numpy as np def evaluate_regression(y_true, y_pred, n_features=None): """Print all regression evaluation metrics""" r2 = r2_score(y_true, y_pred) mse = mean_squared_error(y_true, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_true, y_pred) print(f"R² = {r2:.4f}") print(f"MSE = {mse:.4f}") print(f"RMSE = {rmse:.4f}") print(f"MAE = {mae:.4f}") if n_features is not None: n = len(y_true) adj_r2 = 1 - (1 - r2) * (n - 1) / (n - n_features - 1) print(f"Adjusted R² = {adj_r2:.4f}") return {"r2": r2, "mse": mse, "rmse": rmse, "mae": mae} # Usage example metrics = evaluate_regression(y_test, y_pred, n_features=X_train.shape[1])

How to Judge Model Quality?

- R² > 0.9: Excellent (common in physics/engineering)
- R² 0.7-0.9: Good (common in business/social science)
- R² 0.5-0.7: Fair, may need more features or nonlinear models
- R² < 0.5: Poor, linearity assumption may not hold
- Key: Compare metrics between training and test sets. Large gap indicates overfitting.

Visualization

Scatter Plot + Regression Line

import matplotlib.pyplot as plt import numpy as np from sklearn.linear_model import LinearRegression # Generate example data np.random.seed(42) X = 2 * np.random.rand(80, 1) y = 4 + 3 * X.squeeze() + np.random.randn(80) * 0.8 # Train model model = LinearRegression().fit(X, y) # Create plot fig, ax = plt.subplots(figsize=(8, 5)) ax.scatter(X, y, alpha=0.6, color='#6c63ff', label='Data points') # Draw regression line X_line = np.linspace(0, 2, 100).reshape(-1, 1) y_line = model.predict(X_line) ax.plot(X_line, y_line, color='#ff6b6b', linewidth=2, label=f'y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}') ax.set_xlabel('X') ax.set_ylabel('y') ax.set_title('Linear Regression Fit') ax.legend() plt.tight_layout() plt.savefig('regression_fit.png', dpi=150) plt.show()

Residual Plot (Check Model Assumptions)

fig, axes = plt.subplots(1, 2, figsize=(12, 4)) y_pred = model.predict(X) residuals = y - y_pred # Residuals vs Predicted axes[0].scatter(y_pred, residuals, alpha=0.6, color='#6c63ff') axes[0].axhline(y=0, color='#ff6b6b', linestyle='--') axes[0].set_xlabel('Predicted') axes[0].set_ylabel('Residuals') axes[0].set_title('Residuals vs Predicted') # Residual histogram (check normality) axes[1].hist(residuals, bins=20, color='#6c63ff', alpha=0.7, edgecolor='white') axes[1].set_xlabel('Residual') axes[1].set_ylabel('Frequency') axes[1].set_title('Residual Distribution') plt.tight_layout() plt.savefig('residual_analysis.png', dpi=150) plt.show() # Ideal: residual plot shows no pattern, histogram is approximately normal

Learning Curve (Check Overfitting/Underfitting)

from sklearn.model_selection import learning_curve train_sizes, train_scores, val_scores = learning_curve( LinearRegression(), X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10), scoring='r2' ) fig, ax = plt.subplots(figsize=(8, 5)) ax.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training R²') ax.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation R²') ax.fill_between(train_sizes, train_scores.mean(axis=1) - train_scores.std(axis=1), train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1) ax.fill_between(train_sizes, val_scores.mean(axis=1) - val_scores.std(axis=1), val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1) ax.set_xlabel('Training Set Size') ax.set_ylabel('R² Score') ax.set_title('Learning Curve') ax.legend() plt.tight_layout() plt.show()

Common Pitfalls & Solutions

Pitfall Symptoms Solution
Multicollinearity Large, unstable weights; signs may contradict intuition Calculate VIF (Variance Inflation Factor), remove/combine features with VIF > 10; use Ridge regularization
Overfitting High training R² but low test R²; features > samples Reduce features; add regularization (Ridge/Lasso); get more training data; use cross-validation
Underfitting Both training and test R² are low Add polynomial or interaction features; consider nonlinear models; do more feature engineering
Outlier Impact Individual extreme values severely skew the regression line Detect and handle outliers with IQR or Z-score; use RANSAC or Huber robust regression
Data Leakage Abnormally high test R² (>0.99), performance drops in production Ensure test data is unseen during training; fit scaler on training set only
Ignoring Feature Scaling Gradient descent doesn't converge or converges slowly; regularization behaves unexpectedly Use StandardScaler; Ridge/Lasso/ElasticNet require standardization

Multicollinearity Detection Code

from statsmodels.stats.outliers_influence import variance_inflation_factor import pandas as pd def check_vif(X, feature_names): """Calculate VIF for each feature""" vif_data = pd.DataFrame() vif_data['Feature'] = feature_names vif_data['VIF'] = [ variance_inflation_factor(X, i) for i in range(X.shape[1]) ] vif_data = vif_data.sort_values('VIF', ascending=False) print(vif_data.to_string(index=False)) # VIF > 10: severe multicollinearity, consider removing # VIF 5-10: moderate multicollinearity, worth investigating # VIF < 5: acceptable return vif_data

Robust Regression for Outliers

from sklearn.linear_model import RANSACRegressor, HuberRegressor # RANSAC: Random Sample Consensus, ignores outliers ransac = RANSACRegressor(random_state=42) ransac.fit(X_train, y_train) inlier_mask = ransac.inlier_mask_ print(f"Inliers: {inlier_mask.sum()}/{len(inlier_mask)}") # Huber: Squared loss for small errors, absolute loss for large errors huber = HuberRegressor(epsilon=1.35) huber.fit(X_train, y_train) print(f"Huber R²: {huber.score(X_test, y_test):.4f}")

When to Use / Not Use Linear Regression

Use Linear Regression When

Small to medium data Interpretability needed Linear relationship As a baseline model Fast training needed Good after feature engineering

Don't Use Linear Regression When

Classification problem Strong nonlinear relationship Many outliers Complex feature interactions Image/text/sequence data Features >> samples

Comparison with Other Regression Algorithms

Algorithm Strengths Weaknesses Complexity
Linear Regression Simple, fast, interpretable Can only fit linear relationships O(np²)
Decision Tree Handles nonlinearity, no scaling needed Overfits easily, unstable O(np log n)
Random Forest Robust, handles nonlinearity well Slower, less interpretable O(k * np log n)
XGBoost / LightGBM High accuracy, handles complex patterns Needs tuning, can overfit O(k * np log n)
Neural Network Any complex pattern, good with large data Needs lots of data and compute Depends on architecture

Practical advice: Always start with linear regression as your baseline model on any regression task. If R² is already sufficient (>0.8), you may not need a more complex model. Complex models don't always beat simple ones — always start simple.

Frequently Asked Questions (FAQ)

Q1: What is the difference between linear regression and logistic regression?
Linear regression predicts continuous values (e.g., house prices) with unbounded output; logistic regression predicts class probabilities (e.g., buy or not), with output mapped to [0, 1] via the Sigmoid function. Despite the name "regression," logistic regression is fundamentally a classification algorithm. They also use different cost functions: MSE for linear regression, Cross-Entropy for logistic regression.
Q2: Why is my R² negative?
A negative R² means your model is worse than simply predicting the mean value. Common causes: (1) Model not properly trained (e.g., gradient descent didn't converge); (2) Large distribution shift between train and test sets; (3) Strong nonlinear relationship but you used a linear model; (4) Feature scaling issues. Debug: ensure training R² is positive, then investigate test set issues.
Q3: Does linear regression require feature scaling?
It depends. If using the normal equation or sklearn's LinearRegression (which uses SVD internally), scaling is NOT needed — results are identical. However, scaling IS required when: (1) Using gradient descent — different feature scales cause slow or no convergence; (2) Using Ridge/Lasso/ElasticNet — regularization penalizes unstandardized features unevenly; (3) Comparing feature importances via weights — weights are only comparable after standardization.
Q4: Can linear regression be used for time series forecasting?
Yes, but beware of the independence assumption. Time series residuals typically exhibit autocorrelation, violating the independence assumption. Solutions: (1) Use lag features, taking previous N steps as input features; (2) Add time-related features (month, day of week, holidays, etc.); (3) Consider dedicated time series models like ARIMA, Prophet, or LSTM. Linear regression works as a quick baseline, but specialized models usually perform better.
Q5: How do I choose the alpha value for Ridge and Lasso?
Use cross-validation for automatic selection. Sklearn provides RidgeCV and LassoCV with built-in cross-validation to search for optimal alpha. Recommended approach: (1) Set a log-uniform list of candidates like [0.001, 0.01, 0.1, 1, 10, 100]; (2) Use 5-fold or 10-fold cross-validation; (3) Plot the CV score curve and pick the alpha at the elbow point. Alpha too small = almost no regularization (may overfit), alpha too large = too much regularization (may underfit).