What is Linear Regression

Linear regression is one of the most fundamental and important algorithms in statistics and machine learning. The core idea is simple: find the "best fit line" that describes the linear relationship between independent variables (X) and a dependent variable (y), enabling predictions of continuous numerical values.

Intuitive Understanding

Imagine you have a set of "house area vs. price" data points scattered on a 2D plane. Linear regression finds a straight line that minimizes the total "distance" from all data points to the line. With this line, given any area, you can predict the corresponding price — that is the essence of linear regression.

When to Use It

Linear regression is used to predict continuous values (not categorical labels). Typical applications include: house price prediction, sales forecasting, temperature prediction, stock trend analysis, and modeling the relationship between ad spend and revenue.

Although "linear" sounds simple, linear regression is the cornerstone of many advanced models. Understanding its mathematical principles greatly helps when learning logistic regression, neural networks, SVMs, and more. Mastering linear regression is the first key to mastering machine learning.

Mathematical Foundation

Simple Linear Regression

With one independent variable x and one dependent variable y, the model is:

y = wx + b

Where w (weight/slope) represents how much y changes for each unit increase in x; b (bias/intercept) represents the value of y when x = 0.

Multiple Linear Regression

When there are multiple independent variables, the model extends to:

y = w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ + b

In matrix form, this can be expressed concisely as:

Y = Xβ + ε Where: Y = [y₁, y₂, ..., yₙ]ᵀ (n×1 target vector) X = [[1, x₁₁, x₁₂, ..., x₁ₚ], (n×(p+1) design matrix, first column all 1s) [1, x₂₁, x₂₂, ..., x₂ₚ], ...] β = [b, w₁, w₂, ..., wₚ]ᵀ (parameter vector) ε = [ε₁, ε₂, ..., εₙ]ᵀ (error vector)

Cost Function: Mean Squared Error (MSE)

We need a metric to measure "how well the line fits." The most common is Mean Squared Error (MSE):

MSE = (1/n) Σ(yᵢ - ŷᵢ)² = (1/n) Σ(yᵢ - (wxᵢ + b))² Where yᵢ is the actual value, ŷᵢ is the predicted value, n is the number of samples

The smaller the MSE, the better the fit. The training goal is to find the parameters w and b that minimize MSE.

Normal Equation — Closed-Form Solution

By taking the derivative of MSE and setting it to zero, we can directly obtain the closed-form solution:

β = (XᵀX)⁻¹ Xᵀy

The normal equation computes the optimal solution in one step but requires matrix inversion. When the number of features is large (e.g., 10,000+), the computational complexity of matrix inversion is O(p³), making gradient descent more efficient.

Gradient Descent — Iterative Optimization

Gradient descent iteratively adjusts parameters to minimize the cost function. The update rule at each iteration is:

w = w - α * (∂MSE/∂w) b = b - α * (∂MSE/∂b) Where: ∂MSE/∂w = (-2/n) Σ xᵢ(yᵢ - ŷᵢ) ∂MSE/∂b = (-2/n) Σ (yᵢ - ŷᵢ) α = learning rate, controls the step size of each update

Four Key Assumptions of Linear Regression

For linear regression to produce reliable results, four core assumptions must hold. Violating them doesn't mean the model is useless, but it affects the reliability of parameter estimates and accuracy of confidence intervals.

1. Linearity

A linear relationship exists between independent and dependent variables. This can be checked with scatter plots. If the relationship is nonlinear (e.g., quadratic), polynomial features can be added.

2. Independence

Observations are independent of each other; residuals show no autocorrelation. This is easily violated with time series data. The Durbin-Watson test can detect this (a value near 2 indicates no autocorrelation).

3. Homoscedasticity

The variance of residuals remains constant across all levels of predicted values. If residuals form a "funnel shape" (variance increases with predicted values), heteroscedasticity is present. This can be mitigated by log-transforming y or using weighted least squares.

4. Normality of Errors

Residuals should approximately follow a normal distribution (mean = 0). This can be verified with Q-Q plots or the Shapiro-Wilk test. With large samples (n > 30), the Central Limit Theorem reduces the impact of this assumption.

Python Implementation from Scratch

Implementing gradient descent linear regression with NumPy, without any ML library:

import numpy as np

class LinearRegressionGD:
    """Linear Regression from scratch using Gradient Descent"""

    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []  # Track cost at each iteration

    def fit(self, X, y):
        n_samples, n_features = X.shape

        # Initialize parameters to zero
        self.weights = np.zeros(n_features)
        self.bias = 0

        for i in range(self.n_iter):
            # Forward pass: compute predictions
            y_pred = np.dot(X, self.weights) + self.bias

            # Compute cost (MSE)
            cost = np.mean((y - y_pred) ** 2)
            self.cost_history.append(cost)

            # Compute gradients
            dw = (-2 / n_samples) * np.dot(X.T, (y - y_pred))
            db = (-2 / n_samples) * np.sum(y - y_pred)

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

        return self

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

    def score(self, X, y):
        """Compute R² coefficient of determination"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)


# ===== Usage Example =====
# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)          # 100 samples, 1 feature
y = 4 + 3 * X.squeeze() + np.random.randn(100) * 0.5  # y = 3x + 4 + noise

# Feature scaling (speeds up convergence)
X_mean, X_std = X.mean(axis=0), X.std(axis=0)
X_scaled = (X - X_mean) / X_std

# Train the model
model = LinearRegressionGD(learning_rate=0.1, n_iterations=500)
model.fit(X_scaled, y)

print(f"R² = {model.score(X_scaled, y):.4f}")
print(f"Final MSE: {model.cost_history[-1]:.4f}")

Normal Equation Implementation

def normal_equation(X, y):
    """Normal equation closed-form: beta = (X^T X)^(-1) X^T y"""
    # Add bias column (column of 1s)
    X_b = np.c_[np.ones((X.shape[0], 1)), X]
    # Compute optimal parameters
    beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    return beta  # beta[0] = intercept, beta[1:] = weights

beta = normal_equation(X, y)
print(f"Intercept: {beta[0]:.4f}, Weight: {beta[1]:.4f}")
# Output close to 4.0 and 3.0 (our ground truth)

Sklearn Implementation

In practice, use scikit-learn's LinearRegression. It internally uses the normal equation (via SVD decomposition) for better numerical stability:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

# Prepare data
np.random.seed(42)
X = np.random.rand(200, 3)  # 200 samples, 3 features
y = 5 + 2*X[:, 0] - 3*X[:, 1] + 1.5*X[:, 2] + np.random.randn(200) * 0.3

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Inspect learned parameters
print(f"Intercept (b): {model.intercept_:.4f}")
print(f"Coefficients (w): {model.coef_}")
# Should be close to [2.0, -3.0, 1.5]

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate the model
print(f"\nR²  = {r2_score(y_test, y_pred):.4f}")
print(f"MSE = {mean_squared_error(y_test, y_pred):.4f}")
print(f"MAE = {mean_absolute_error(y_test, y_pred):.4f}")

Sklearn LinearRegression Key Points

- Default fit_intercept=True, handles intercept automatically
- Feature scaling is NOT required (uses SVD internally, not gradient descent)
- No hyperparameters to tune, no regularization (use Ridge/Lasso for that)
- Suitable for moderate number of features (features < 10,000)

Regularization: Ridge / Lasso / ElasticNet

Regularization prevents overfitting by adding a penalty term to the cost function. It's especially important when there are many features or multicollinearity exists.

Ridge Regression (L2 Regularization)

Cost = MSE + α * Σ(wᵢ²) α controls regularization strength. L2 penalty shrinks weights toward small values but not to zero.

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Ridge requires feature scaling
ridge_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(alpha=1.0))  # alpha = regularization strength
])

ridge_pipe.fit(X_train, y_train)
print(f"Ridge R²: {ridge_pipe.score(X_test, y_test):.4f}")

# Use cross-validation to find optimal alpha
from sklearn.linear_model import RidgeCV
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")

Lasso Regression (L1 Regularization)

Cost = MSE + α * Σ|wᵢ| L1 penalty can shrink some weights to exactly zero, achieving feature selection.

from sklearn.linear_model import Lasso, LassoCV

# Lasso Regression
lasso = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso(alpha=0.1))
])
lasso.fit(X_train, y_train)
print(f"Lasso R²: {lasso.score(X_test, y_test):.4f}")

# See which features were selected (non-zero weights)
lasso_model = lasso.named_steps['lasso']
for i, coef in enumerate(lasso_model.coef_):
    status = "kept" if abs(coef) > 1e-6 else "dropped"
    print(f"  Feature {i}: w={coef:.4f} ({status})")

ElasticNet (L1 + L2)

Cost = MSE + α * [ρ * Σ|wᵢ| + (1-ρ) * Σ(wᵢ²)] ρ (l1_ratio) controls the mix of L1 and L2. ρ=1 equals Lasso, ρ=0 equals Ridge.

from sklearn.linear_model import ElasticNet, ElasticNetCV

# Cross-validate both alpha and l1_ratio
enet_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0],
    alphas=[0.001, 0.01, 0.1, 1.0],
    cv=5
)
enet_cv.fit(X_train, y_train)
print(f"Best alpha: {enet_cv.alpha_:.4f}")
print(f"Best l1_ratio: {enet_cv.l1_ratio_:.2f}")
print(f"ElasticNet R²: {enet_cv.score(X_test, y_test):.4f}")

How to Choose a Regularization Method

Method	When to Use	Feature Selection
Ridge (L2)	All features likely useful, just want to shrink weights	No
Lasso (L1)	Suspect many features are irrelevant, want automatic selection	Yes
ElasticNet	Many features with groups of highly correlated features	Yes (grouped)

Feature Engineering

Feature engineering is the key step to improving linear regression performance. Good features matter more than complex models.

Polynomial Features

When data has a nonlinear relationship, create polynomial features to let the linear model fit curves:

from sklearn.preprocessing import PolynomialFeatures

# Original features: [x₁, x₂]
# After degree=2: [1, x₁, x₂, x₁², x₁x₂, x₂²]
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train)

print(f"Original features: {X_train.shape[1]}")
print(f"Polynomial features: {X_poly.shape[1]}")

# Train linear regression with polynomial features
from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    StandardScaler(),
    Ridge(alpha=1.0)  # Polynomial features overfit easily, add regularization
)
poly_model.fit(X_train, y_train)
print(f"Polynomial R²: {poly_model.score(X_test, y_test):.4f}")

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization: mean=0, std=1 (recommended for linear regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# Min-Max scaling: scale to [0, 1] range
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X_train)

# Important: use the training scaler on test set (don't re-fit)
X_test_scaled = scaler.transform(X_test)  # transform only, no fit!

One-Hot Encoding

Linear regression only handles numerical features. Categorical features must be encoded:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Example data
df = pd.DataFrame({
    'area': [80, 120, 60, 200],
    'rooms': [2, 3, 1, 4],
    'city': ['Beijing', 'Shanghai', 'Beijing', 'Shenzhen'],
    'price': [300, 500, 200, 800]
})

# Automatically handle numerical + categorical features
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), ['area', 'rooms']),
    ('cat', OneHotEncoder(drop='first'), ['city'])
    # drop='first' avoids the dummy variable trap
])

X = preprocessor.fit_transform(df[['area', 'rooms', 'city']])
print(f"Encoded feature dimensions: {X.shape}")

Model Evaluation Metrics

Metric	Formula	Interpretation
R²	1 - SS_res / SS_tot	How much variance the model explains. 1 = perfect, 0 = same as mean prediction, negative = worse than mean
MSE	(1/n) Σ(yᵢ - ŷᵢ)²	Average squared error, penalizes large errors more heavily
RMSE	√MSE	Same unit as y, more intuitive. RMSE=5 means average error is about 5 units
MAE	(1/n) Σ\|yᵢ - ŷᵢ\|	Average absolute error, robust to outliers
Adjusted R²	1 - (1-R²)(n-1)/(n-p-1)	R² adjusted for number of features, prevents inflating R² by adding useless features

from sklearn.metrics import (
    r2_score, mean_squared_error, mean_absolute_error
)
import numpy as np

def evaluate_regression(y_true, y_pred, n_features=None):
    """Print all regression evaluation metrics"""
    r2 = r2_score(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)

    print(f"R²   = {r2:.4f}")
    print(f"MSE  = {mse:.4f}")
    print(f"RMSE = {rmse:.4f}")
    print(f"MAE  = {mae:.4f}")

    if n_features is not None:
        n = len(y_true)
        adj_r2 = 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
        print(f"Adjusted R² = {adj_r2:.4f}")

    return {"r2": r2, "mse": mse, "rmse": rmse, "mae": mae}

# Usage example
metrics = evaluate_regression(y_test, y_pred, n_features=X_train.shape[1])

How to Judge Model Quality?

- R² > 0.9: Excellent (common in physics/engineering)
- R² 0.7-0.9: Good (common in business/social science)
- R² 0.5-0.7: Fair, may need more features or nonlinear models
- R² < 0.5: Poor, linearity assumption may not hold
- Key: Compare metrics between training and test sets. Large gap indicates overfitting.

Visualization

Scatter Plot + Regression Line

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

# Generate example data
np.random.seed(42)
X = 2 * np.random.rand(80, 1)
y = 4 + 3 * X.squeeze() + np.random.randn(80) * 0.8

# Train model
model = LinearRegression().fit(X, y)

# Create plot
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(X, y, alpha=0.6, color='#6c63ff', label='Data points')

# Draw regression line
X_line = np.linspace(0, 2, 100).reshape(-1, 1)
y_line = model.predict(X_line)
ax.plot(X_line, y_line, color='#ff6b6b', linewidth=2,
        label=f'y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}')

ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Linear Regression Fit')
ax.legend()
plt.tight_layout()
plt.savefig('regression_fit.png', dpi=150)
plt.show()

Residual Plot (Check Model Assumptions)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

y_pred = model.predict(X)
residuals = y - y_pred

# Residuals vs Predicted
axes[0].scatter(y_pred, residuals, alpha=0.6, color='#6c63ff')
axes[0].axhline(y=0, color='#ff6b6b', linestyle='--')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Predicted')

# Residual histogram (check normality)
axes[1].hist(residuals, bins=20, color='#6c63ff', alpha=0.7, edgecolor='white')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')

plt.tight_layout()
plt.savefig('residual_analysis.png', dpi=150)
plt.show()

# Ideal: residual plot shows no pattern, histogram is approximately normal

Learning Curve (Check Overfitting/Underfitting)

from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    LinearRegression(), X, y, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='r2'
)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training R²')
ax.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation R²')
ax.fill_between(train_sizes,
    train_scores.mean(axis=1) - train_scores.std(axis=1),
    train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
ax.fill_between(train_sizes,
    val_scores.mean(axis=1) - val_scores.std(axis=1),
    val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
ax.set_xlabel('Training Set Size')
ax.set_ylabel('R² Score')
ax.set_title('Learning Curve')
ax.legend()
plt.tight_layout()
plt.show()

Common Pitfalls & Solutions

Pitfall	Symptoms	Solution
Multicollinearity	Large, unstable weights; signs may contradict intuition	Calculate VIF (Variance Inflation Factor), remove/combine features with VIF > 10; use Ridge regularization
Overfitting	High training R² but low test R²; features > samples	Reduce features; add regularization (Ridge/Lasso); get more training data; use cross-validation
Underfitting	Both training and test R² are low	Add polynomial or interaction features; consider nonlinear models; do more feature engineering
Outlier Impact	Individual extreme values severely skew the regression line	Detect and handle outliers with IQR or Z-score; use RANSAC or Huber robust regression
Data Leakage	Abnormally high test R² (>0.99), performance drops in production	Ensure test data is unseen during training; fit scaler on training set only
Ignoring Feature Scaling	Gradient descent doesn't converge or converges slowly; regularization behaves unexpectedly	Use StandardScaler; Ridge/Lasso/ElasticNet require standardization

Multicollinearity Detection Code

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

def check_vif(X, feature_names):
    """Calculate VIF for each feature"""
    vif_data = pd.DataFrame()
    vif_data['Feature'] = feature_names
    vif_data['VIF'] = [
        variance_inflation_factor(X, i) for i in range(X.shape[1])
    ]
    vif_data = vif_data.sort_values('VIF', ascending=False)
    print(vif_data.to_string(index=False))
    # VIF > 10: severe multicollinearity, consider removing
    # VIF 5-10: moderate multicollinearity, worth investigating
    # VIF < 5: acceptable
    return vif_data

Robust Regression for Outliers

from sklearn.linear_model import RANSACRegressor, HuberRegressor

# RANSAC: Random Sample Consensus, ignores outliers
ransac = RANSACRegressor(random_state=42)
ransac.fit(X_train, y_train)
inlier_mask = ransac.inlier_mask_
print(f"Inliers: {inlier_mask.sum()}/{len(inlier_mask)}")

# Huber: Squared loss for small errors, absolute loss for large errors
huber = HuberRegressor(epsilon=1.35)
huber.fit(X_train, y_train)
print(f"Huber R²: {huber.score(X_test, y_test):.4f}")

When to Use / Not Use Linear Regression

Use Linear Regression When

Small to medium data Interpretability needed Linear relationship As a baseline model Fast training needed Good after feature engineering

Don't Use Linear Regression When

Classification problem Strong nonlinear relationship Many outliers Complex feature interactions Image/text/sequence data Features >> samples

Comparison with Other Regression Algorithms

Algorithm	Strengths	Weaknesses	Complexity
Linear Regression	Simple, fast, interpretable	Can only fit linear relationships	O(np²)
Decision Tree	Handles nonlinearity, no scaling needed	Overfits easily, unstable	O(np log n)
Random Forest	Robust, handles nonlinearity well	Slower, less interpretable	O(k * np log n)
XGBoost / LightGBM	High accuracy, handles complex patterns	Needs tuning, can overfit	O(k * np log n)
Neural Network	Any complex pattern, good with large data	Needs lots of data and compute	Depends on architecture

Practical advice: Always start with linear regression as your baseline model on any regression task. If R² is already sufficient (>0.8), you may not need a more complex model. Complex models don't always beat simple ones — always start simple.

ML Algorithms Guide

Comprehensive comparison of classification, regression, and clustering algorithms with decision flowchart.

Regression Analysis Tool

Online regression analysis calculator with multiple regression types and visualization.

Neural Network Architectures Guide

Deep dive into neural network architectures, from Perceptrons to Transformers.

AI Models Comparison

Compare capabilities, performance, and use cases of major AI/ML models.

Frequently Asked Questions (FAQ)

Q1: What is the difference between linear regression and logistic regression?

Linear regression predicts continuous values (e.g., house prices) with unbounded output; logistic regression predicts class probabilities (e.g., buy or not), with output mapped to [0, 1] via the Sigmoid function. Despite the name "regression," logistic regression is fundamentally a classification algorithm. They also use different cost functions: MSE for linear regression, Cross-Entropy for logistic regression.

Q2: Why is my R² negative?

A negative R² means your model is worse than simply predicting the mean value. Common causes: (1) Model not properly trained (e.g., gradient descent didn't converge); (2) Large distribution shift between train and test sets; (3) Strong nonlinear relationship but you used a linear model; (4) Feature scaling issues. Debug: ensure training R² is positive, then investigate test set issues.

Q3: Does linear regression require feature scaling?

It depends. If using the normal equation or sklearn's LinearRegression (which uses SVD internally), scaling is NOT needed — results are identical. However, scaling IS required when: (1) Using gradient descent — different feature scales cause slow or no convergence; (2) Using Ridge/Lasso/ElasticNet — regularization penalizes unstandardized features unevenly; (3) Comparing feature importances via weights — weights are only comparable after standardization.

Q4: Can linear regression be used for time series forecasting?

Yes, but beware of the independence assumption. Time series residuals typically exhibit autocorrelation, violating the independence assumption. Solutions: (1) Use lag features, taking previous N steps as input features; (2) Add time-related features (month, day of week, holidays, etc.); (3) Consider dedicated time series models like ARIMA, Prophet, or LSTM. Linear regression works as a quick baseline, but specialized models usually perform better.

Q5: How do I choose the alpha value for Ridge and Lasso?

Use cross-validation for automatic selection. Sklearn provides RidgeCV and LassoCV with built-in cross-validation to search for optimal alpha. Recommended approach: (1) Set a log-uniform list of candidates like [0.001, 0.01, 0.1, 1, 10, 100]; (2) Use 5-fold or 10-fold cross-validation; (3) Plot the CV score curve and pick the alpha at the elbow point. Alpha too small = almost no regularization (may overfit), alpha too large = too much regularization (may underfit).

线性回归详解