What is Linear Regression
Linear regression is one of the most fundamental and important algorithms in statistics and machine learning. The core idea is simple: find the "best fit line" that describes the linear relationship between independent variables (X) and a dependent variable (y), enabling predictions of continuous numerical values.
Intuitive Understanding
Imagine you have a set of "house area vs. price" data points scattered on a 2D plane. Linear regression finds a straight line that minimizes the total "distance" from all data points to the line. With this line, given any area, you can predict the corresponding price — that is the essence of linear regression.
When to Use It
Linear regression is used to predict continuous values (not categorical labels). Typical applications include: house price prediction, sales forecasting, temperature prediction, stock trend analysis, and modeling the relationship between ad spend and revenue.
Although "linear" sounds simple, linear regression is the cornerstone of many advanced models. Understanding its mathematical principles greatly helps when learning logistic regression, neural networks, SVMs, and more. Mastering linear regression is the first key to mastering machine learning.
Mathematical Foundation
Simple Linear Regression
With one independent variable x and one dependent variable y, the model is:
y = wx + b
Where w (weight/slope) represents how much y changes for each unit increase in x; b (bias/intercept) represents the value of y when x = 0.
Multiple Linear Regression
When there are multiple independent variables, the model extends to:
y = w₁x₁ + w₂x₂ + w₃x₃ + ... + wₙxₙ + b
In matrix form, this can be expressed concisely as:
Y = Xβ + ε
Where:
Y = [y₁, y₂, ..., yₙ]ᵀ (n×1 target vector)
X = [[1, x₁₁, x₁₂, ..., x₁ₚ], (n×(p+1) design matrix, first column all 1s)
[1, x₂₁, x₂₂, ..., x₂ₚ],
...]
β = [b, w₁, w₂, ..., wₚ]ᵀ (parameter vector)
ε = [ε₁, ε₂, ..., εₙ]ᵀ (error vector)
Cost Function: Mean Squared Error (MSE)
We need a metric to measure "how well the line fits." The most common is Mean Squared Error (MSE):
MSE = (1/n) Σ(yᵢ - ŷᵢ)²
= (1/n) Σ(yᵢ - (wxᵢ + b))²
Where yᵢ is the actual value, ŷᵢ is the predicted value, n is the number of samples
The smaller the MSE, the better the fit. The training goal is to find the parameters w and b that minimize MSE.
Normal Equation — Closed-Form Solution
By taking the derivative of MSE and setting it to zero, we can directly obtain the closed-form solution:
β = (XᵀX)⁻¹ Xᵀy
The normal equation computes the optimal solution in one step but requires matrix inversion. When the number of features is large (e.g., 10,000+), the computational complexity of matrix inversion is O(p³), making gradient descent more efficient.
Gradient Descent — Iterative Optimization
Gradient descent iteratively adjusts parameters to minimize the cost function. The update rule at each iteration is:
w = w - α * (∂MSE/∂w)
b = b - α * (∂MSE/∂b)
Where:
∂MSE/∂w = (-2/n) Σ xᵢ(yᵢ - ŷᵢ)
∂MSE/∂b = (-2/n) Σ (yᵢ - ŷᵢ)
α = learning rate, controls the step size of each update
Four Key Assumptions of Linear Regression
For linear regression to produce reliable results, four core assumptions must hold. Violating them doesn't mean the model is useless, but it affects the reliability of parameter estimates and accuracy of confidence intervals.
1. Linearity
A linear relationship exists between independent and dependent variables. This can be checked with scatter plots. If the relationship is nonlinear (e.g., quadratic), polynomial features can be added.
2. Independence
Observations are independent of each other; residuals show no autocorrelation. This is easily violated with time series data. The Durbin-Watson test can detect this (a value near 2 indicates no autocorrelation).
3. Homoscedasticity
The variance of residuals remains constant across all levels of predicted values. If residuals form a "funnel shape" (variance increases with predicted values), heteroscedasticity is present. This can be mitigated by log-transforming y or using weighted least squares.
4. Normality of Errors
Residuals should approximately follow a normal distribution (mean = 0). This can be verified with Q-Q plots or the Shapiro-Wilk test. With large samples (n > 30), the Central Limit Theorem reduces the impact of this assumption.
Python Implementation from Scratch
Implementing gradient descent linear regression with NumPy, without any ML library:
import numpy as np
class LinearRegressionGD:
"""Linear Regression from scratch using Gradient Descent"""
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.lr = learning_rate
self.n_iter = n_iterations
self.weights = None
self.bias = None
self.cost_history = [] # Track cost at each iteration
def fit(self, X, y):
n_samples, n_features = X.shape
# Initialize parameters to zero
self.weights = np.zeros(n_features)
self.bias = 0
for i in range(self.n_iter):
# Forward pass: compute predictions
y_pred = np.dot(X, self.weights) + self.bias
# Compute cost (MSE)
cost = np.mean((y - y_pred) ** 2)
self.cost_history.append(cost)
# Compute gradients
dw = (-2 / n_samples) * np.dot(X.T, (y - y_pred))
db = (-2 / n_samples) * np.sum(y - y_pred)
# Update parameters
self.weights -= self.lr * dw
self.bias -= self.lr * db
return self
def predict(self, X):
return np.dot(X, self.weights) + self.bias
def score(self, X, y):
"""Compute R² coefficient of determination"""
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - (ss_res / ss_tot)
# ===== Usage Example =====
# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(100, 1) # 100 samples, 1 feature
y = 4 + 3 * X.squeeze() + np.random.randn(100) * 0.5 # y = 3x + 4 + noise
# Feature scaling (speeds up convergence)
X_mean, X_std = X.mean(axis=0), X.std(axis=0)
X_scaled = (X - X_mean) / X_std
# Train the model
model = LinearRegressionGD(learning_rate=0.1, n_iterations=500)
model.fit(X_scaled, y)
print(f"R² = {model.score(X_scaled, y):.4f}")
print(f"Final MSE: {model.cost_history[-1]:.4f}")
Normal Equation Implementation
def normal_equation(X, y):
"""Normal equation closed-form: beta = (X^T X)^(-1) X^T y"""
# Add bias column (column of 1s)
X_b = np.c_[np.ones((X.shape[0], 1)), X]
# Compute optimal parameters
beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
return beta # beta[0] = intercept, beta[1:] = weights
beta = normal_equation(X, y)
print(f"Intercept: {beta[0]:.4f}, Weight: {beta[1]:.4f}")
# Output close to 4.0 and 3.0 (our ground truth)
Sklearn Implementation
In practice, use scikit-learn's LinearRegression. It internally uses the normal equation (via SVD decomposition) for better numerical stability:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np
# Prepare data
np.random.seed(42)
X = np.random.rand(200, 3) # 200 samples, 3 features
y = 5 + 2*X[:, 0] - 3*X[:, 1] + 1.5*X[:, 2] + np.random.randn(200) * 0.3
# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Inspect learned parameters
print(f"Intercept (b): {model.intercept_:.4f}")
print(f"Coefficients (w): {model.coef_}")
# Should be close to [2.0, -3.0, 1.5]
# Predict on test set
y_pred = model.predict(X_test)
# Evaluate the model
print(f"\nR² = {r2_score(y_test, y_pred):.4f}")
print(f"MSE = {mean_squared_error(y_test, y_pred):.4f}")
print(f"MAE = {mean_absolute_error(y_test, y_pred):.4f}")
Sklearn LinearRegression Key Points
- Default fit_intercept=True, handles intercept automatically
- Feature scaling is NOT required (uses SVD internally, not gradient descent)
- No hyperparameters to tune, no regularization (use Ridge/Lasso for that)
- Suitable for moderate number of features (features < 10,000)
Regularization: Ridge / Lasso / ElasticNet
Regularization prevents overfitting by adding a penalty term to the cost function. It's especially important when there are many features or multicollinearity exists.
Ridge Regression (L2 Regularization)
Cost = MSE + α * Σ(wᵢ²)
α controls regularization strength. L2 penalty shrinks weights toward small values but not to zero.
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Ridge requires feature scaling
ridge_pipe = Pipeline([
('scaler', StandardScaler()),
('ridge', Ridge(alpha=1.0)) # alpha = regularization strength
])
ridge_pipe.fit(X_train, y_train)
print(f"Ridge R²: {ridge_pipe.score(X_test, y_test):.4f}")
# Use cross-validation to find optimal alpha
from sklearn.linear_model import RidgeCV
ridge_cv = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0], cv=5)
ridge_cv.fit(X_train, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
Lasso Regression (L1 Regularization)
Cost = MSE + α * Σ|wᵢ|
L1 penalty can shrink some weights to exactly zero, achieving feature selection.
from sklearn.linear_model import Lasso, LassoCV
# Lasso Regression
lasso = Pipeline([
('scaler', StandardScaler()),
('lasso', Lasso(alpha=0.1))
])
lasso.fit(X_train, y_train)
print(f"Lasso R²: {lasso.score(X_test, y_test):.4f}")
# See which features were selected (non-zero weights)
lasso_model = lasso.named_steps['lasso']
for i, coef in enumerate(lasso_model.coef_):
status = "kept" if abs(coef) > 1e-6 else "dropped"
print(f" Feature {i}: w={coef:.4f} ({status})")
ElasticNet (L1 + L2)
Cost = MSE + α * [ρ * Σ|wᵢ| + (1-ρ) * Σ(wᵢ²)]
ρ (l1_ratio) controls the mix of L1 and L2. ρ=1 equals Lasso, ρ=0 equals Ridge.
from sklearn.linear_model import ElasticNet, ElasticNetCV
# Cross-validate both alpha and l1_ratio
enet_cv = ElasticNetCV(
l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0],
alphas=[0.001, 0.01, 0.1, 1.0],
cv=5
)
enet_cv.fit(X_train, y_train)
print(f"Best alpha: {enet_cv.alpha_:.4f}")
print(f"Best l1_ratio: {enet_cv.l1_ratio_:.2f}")
print(f"ElasticNet R²: {enet_cv.score(X_test, y_test):.4f}")
How to Choose a Regularization Method
| Method |
When to Use |
Feature Selection |
| Ridge (L2) | All features likely useful, just want to shrink weights | No |
| Lasso (L1) | Suspect many features are irrelevant, want automatic selection | Yes |
| ElasticNet | Many features with groups of highly correlated features | Yes (grouped) |
Feature Engineering
Feature engineering is the key step to improving linear regression performance. Good features matter more than complex models.
Polynomial Features
When data has a nonlinear relationship, create polynomial features to let the linear model fit curves:
from sklearn.preprocessing import PolynomialFeatures
# Original features: [x₁, x₂]
# After degree=2: [1, x₁, x₂, x₁², x₁x₂, x₂²]
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_train)
print(f"Original features: {X_train.shape[1]}")
print(f"Polynomial features: {X_poly.shape[1]}")
# Train linear regression with polynomial features
from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False),
StandardScaler(),
Ridge(alpha=1.0) # Polynomial features overfit easily, add regularization
)
poly_model.fit(X_train, y_train)
print(f"Polynomial R²: {poly_model.score(X_test, y_test):.4f}")
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization: mean=0, std=1 (recommended for linear regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
# Min-Max scaling: scale to [0, 1] range
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X_train)
# Important: use the training scaler on test set (don't re-fit)
X_test_scaled = scaler.transform(X_test) # transform only, no fit!
One-Hot Encoding
Linear regression only handles numerical features. Categorical features must be encoded:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Example data
df = pd.DataFrame({
'area': [80, 120, 60, 200],
'rooms': [2, 3, 1, 4],
'city': ['Beijing', 'Shanghai', 'Beijing', 'Shenzhen'],
'price': [300, 500, 200, 800]
})
# Automatically handle numerical + categorical features
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), ['area', 'rooms']),
('cat', OneHotEncoder(drop='first'), ['city'])
# drop='first' avoids the dummy variable trap
])
X = preprocessor.fit_transform(df[['area', 'rooms', 'city']])
print(f"Encoded feature dimensions: {X.shape}")
Model Evaluation Metrics
| Metric |
Formula |
Interpretation |
| R² |
1 - SS_res / SS_tot |
How much variance the model explains. 1 = perfect, 0 = same as mean prediction, negative = worse than mean |
| MSE |
(1/n) Σ(yᵢ - ŷᵢ)² |
Average squared error, penalizes large errors more heavily |
| RMSE |
√MSE |
Same unit as y, more intuitive. RMSE=5 means average error is about 5 units |
| MAE |
(1/n) Σ|yᵢ - ŷᵢ| |
Average absolute error, robust to outliers |
| Adjusted R² |
1 - (1-R²)(n-1)/(n-p-1) |
R² adjusted for number of features, prevents inflating R² by adding useless features |
from sklearn.metrics import (
r2_score, mean_squared_error, mean_absolute_error
)
import numpy as np
def evaluate_regression(y_true, y_pred, n_features=None):
"""Print all regression evaluation metrics"""
r2 = r2_score(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
print(f"R² = {r2:.4f}")
print(f"MSE = {mse:.4f}")
print(f"RMSE = {rmse:.4f}")
print(f"MAE = {mae:.4f}")
if n_features is not None:
n = len(y_true)
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
print(f"Adjusted R² = {adj_r2:.4f}")
return {"r2": r2, "mse": mse, "rmse": rmse, "mae": mae}
# Usage example
metrics = evaluate_regression(y_test, y_pred, n_features=X_train.shape[1])
How to Judge Model Quality?
- R² > 0.9: Excellent (common in physics/engineering)
- R² 0.7-0.9: Good (common in business/social science)
- R² 0.5-0.7: Fair, may need more features or nonlinear models
- R² < 0.5: Poor, linearity assumption may not hold
- Key: Compare metrics between training and test sets. Large gap indicates overfitting.
Visualization
Scatter Plot + Regression Line
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
# Generate example data
np.random.seed(42)
X = 2 * np.random.rand(80, 1)
y = 4 + 3 * X.squeeze() + np.random.randn(80) * 0.8
# Train model
model = LinearRegression().fit(X, y)
# Create plot
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(X, y, alpha=0.6, color='#6c63ff', label='Data points')
# Draw regression line
X_line = np.linspace(0, 2, 100).reshape(-1, 1)
y_line = model.predict(X_line)
ax.plot(X_line, y_line, color='#ff6b6b', linewidth=2,
label=f'y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Linear Regression Fit')
ax.legend()
plt.tight_layout()
plt.savefig('regression_fit.png', dpi=150)
plt.show()
Residual Plot (Check Model Assumptions)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
y_pred = model.predict(X)
residuals = y - y_pred
# Residuals vs Predicted
axes[0].scatter(y_pred, residuals, alpha=0.6, color='#6c63ff')
axes[0].axhline(y=0, color='#ff6b6b', linestyle='--')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Predicted')
# Residual histogram (check normality)
axes[1].hist(residuals, bins=20, color='#6c63ff', alpha=0.7, edgecolor='white')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
plt.tight_layout()
plt.savefig('residual_analysis.png', dpi=150)
plt.show()
# Ideal: residual plot shows no pattern, histogram is approximately normal
Learning Curve (Check Overfitting/Underfitting)
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
LinearRegression(), X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='r2'
)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training R²')
ax.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation R²')
ax.fill_between(train_sizes,
train_scores.mean(axis=1) - train_scores.std(axis=1),
train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
ax.fill_between(train_sizes,
val_scores.mean(axis=1) - val_scores.std(axis=1),
val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
ax.set_xlabel('Training Set Size')
ax.set_ylabel('R² Score')
ax.set_title('Learning Curve')
ax.legend()
plt.tight_layout()
plt.show()
Common Pitfalls & Solutions
| Pitfall |
Symptoms |
Solution |
| Multicollinearity |
Large, unstable weights; signs may contradict intuition |
Calculate VIF (Variance Inflation Factor), remove/combine features with VIF > 10; use Ridge regularization |
| Overfitting |
High training R² but low test R²; features > samples |
Reduce features; add regularization (Ridge/Lasso); get more training data; use cross-validation |
| Underfitting |
Both training and test R² are low |
Add polynomial or interaction features; consider nonlinear models; do more feature engineering |
| Outlier Impact |
Individual extreme values severely skew the regression line |
Detect and handle outliers with IQR or Z-score; use RANSAC or Huber robust regression |
| Data Leakage |
Abnormally high test R² (>0.99), performance drops in production |
Ensure test data is unseen during training; fit scaler on training set only |
| Ignoring Feature Scaling |
Gradient descent doesn't converge or converges slowly; regularization behaves unexpectedly |
Use StandardScaler; Ridge/Lasso/ElasticNet require standardization |
Multicollinearity Detection Code
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
def check_vif(X, feature_names):
"""Calculate VIF for each feature"""
vif_data = pd.DataFrame()
vif_data['Feature'] = feature_names
vif_data['VIF'] = [
variance_inflation_factor(X, i) for i in range(X.shape[1])
]
vif_data = vif_data.sort_values('VIF', ascending=False)
print(vif_data.to_string(index=False))
# VIF > 10: severe multicollinearity, consider removing
# VIF 5-10: moderate multicollinearity, worth investigating
# VIF < 5: acceptable
return vif_data
Robust Regression for Outliers
from sklearn.linear_model import RANSACRegressor, HuberRegressor
# RANSAC: Random Sample Consensus, ignores outliers
ransac = RANSACRegressor(random_state=42)
ransac.fit(X_train, y_train)
inlier_mask = ransac.inlier_mask_
print(f"Inliers: {inlier_mask.sum()}/{len(inlier_mask)}")
# Huber: Squared loss for small errors, absolute loss for large errors
huber = HuberRegressor(epsilon=1.35)
huber.fit(X_train, y_train)
print(f"Huber R²: {huber.score(X_test, y_test):.4f}")
When to Use / Not Use Linear Regression
Use Linear Regression When
Small to medium data
Interpretability needed
Linear relationship
As a baseline model
Fast training needed
Good after feature engineering
Don't Use Linear Regression When
Classification problem
Strong nonlinear relationship
Many outliers
Complex feature interactions
Image/text/sequence data
Features >> samples
Comparison with Other Regression Algorithms
| Algorithm |
Strengths |
Weaknesses |
Complexity |
| Linear Regression |
Simple, fast, interpretable |
Can only fit linear relationships |
O(np²) |
| Decision Tree |
Handles nonlinearity, no scaling needed |
Overfits easily, unstable |
O(np log n) |
| Random Forest |
Robust, handles nonlinearity well |
Slower, less interpretable |
O(k * np log n) |
| XGBoost / LightGBM |
High accuracy, handles complex patterns |
Needs tuning, can overfit |
O(k * np log n) |
| Neural Network |
Any complex pattern, good with large data |
Needs lots of data and compute |
Depends on architecture |
Practical advice: Always start with linear regression as your baseline model on any regression task. If R² is already sufficient (>0.8), you may not need a more complex model. Complex models don't always beat simple ones — always start simple.
Frequently Asked Questions (FAQ)
Q1: What is the difference between linear regression and logistic regression?
Linear regression predicts continuous values (e.g., house prices) with unbounded output; logistic regression predicts class probabilities (e.g., buy or not), with output mapped to [0, 1] via the Sigmoid function. Despite the name "regression," logistic regression is fundamentally a classification algorithm. They also use different cost functions: MSE for linear regression, Cross-Entropy for logistic regression.
Q2: Why is my R² negative?
A negative R² means your model is worse than simply predicting the mean value. Common causes: (1) Model not properly trained (e.g., gradient descent didn't converge); (2) Large distribution shift between train and test sets; (3) Strong nonlinear relationship but you used a linear model; (4) Feature scaling issues. Debug: ensure training R² is positive, then investigate test set issues.
Q3: Does linear regression require feature scaling?
It depends. If using the normal equation or sklearn's LinearRegression (which uses SVD internally), scaling is NOT needed — results are identical. However, scaling IS required when: (1) Using gradient descent — different feature scales cause slow or no convergence; (2) Using Ridge/Lasso/ElasticNet — regularization penalizes unstandardized features unevenly; (3) Comparing feature importances via weights — weights are only comparable after standardization.
Q4: Can linear regression be used for time series forecasting?
Yes, but beware of the independence assumption. Time series residuals typically exhibit autocorrelation, violating the independence assumption. Solutions: (1) Use lag features, taking previous N steps as input features; (2) Add time-related features (month, day of week, holidays, etc.); (3) Consider dedicated time series models like ARIMA, Prophet, or LSTM. Linear regression works as a quick baseline, but specialized models usually perform better.
Q5: How do I choose the alpha value for Ridge and Lasso?
Use cross-validation for automatic selection. Sklearn provides RidgeCV and LassoCV with built-in cross-validation to search for optimal alpha. Recommended approach: (1) Set a log-uniform list of candidates like [0.001, 0.01, 0.1, 1, 10, 100]; (2) Use 5-fold or 10-fold cross-validation; (3) Plot the CV score curve and pick the alpha at the elbow point. Alpha too small = almost no regularization (may overfit), alpha too large = too much regularization (may underfit).