1. What is Random Forest

Random Forest is a classic ensemble learning algorithm published by Leo Breiman in 2001 in the Machine Learning journal (Breiman, "Random Forests", Machine Learning, 45(1):5-32, 2001). It remains one of the most widely used ML algorithms in both Kaggle competitions and production systems.

Origin of the Name

The name "Random Forest" precisely describes the algorithm's core idea:

"Random" refers to two levels of randomness:
1. Sample randomness: Each tree is trained on a Bootstrap sample (random sampling with replacement), meaning each tree sees a slightly different subset of the data.
2. Feature randomness: At each node split, only a random subset of features is considered as candidates, rather than all features.

"Forest" refers to the ensemble of many decision trees combined into a single model. A single tree may be unstable (high variance), but the collective vote (classification) or average (regression) of hundreds of trees is remarkably robust. This is the mathematical realization of the "wisdom of crowds" phenomenon.

Breiman noted in his paper that Random Forest combines two earlier ideas: Bagging (Bootstrap Aggregating, which Breiman himself proposed in 1996) and the random subspace method (Ho, 1998). He proved that this combination produces a strong classifier with a bounded generalization error that converges as the number of trees increases -- without overfitting.

2. Why Random Forest Works — Mathematical Argument

2.1 Bias-Variance Tradeoff

Understanding Random Forest begins with the bias-variance decomposition. For any model, the expected prediction error decomposes as:

Bias-Variance Decomposition

E[(y - f̂(x))²] = Bias²(f̂) + Var(f̂) + σ²_noise

Why does this matter? An unpruned decision tree perfectly fits the training data (low bias) but is highly sensitive to which specific data points are in the training set (high variance). This is the essence of overfitting. Random Forest's core strategy: keep each tree's low bias (no pruning) while drastically reducing variance through ensembling.

2.2 Bagging — Why Averaging Reduces Variance

Bagging (Bootstrap Aggregating) was proposed by Breiman in 1996 ("Bagging Predictors", Machine Learning, 24(2):123-140). The core idea is deceptively simple yet profound:

Variance Reduction of Independent Random Variables

If X&sub1;, X&sub2;, ..., X_B are i.i.d. random variables, each with variance σ², then:

Var(X̄) = Var((1/B)∑Xᵢ) = σ²/B

Why does this work? Averaging B independent predictors shrinks the variance by a factor of B. This is one of the most fundamental results in statistics. The problem is: we only have one dataset, so Bootstrap sampling is used to "simulate" multiple datasets from the single one we have.

However, Bootstrap samples share approximately 63.2% of their data points, so the trees are not truly independent. This means the simple σ²/B formula is overly optimistic.

2.3 Feature Randomness — Why Reducing Tree Correlation is Critical

This is Breiman's key innovation. When trees have pairwise correlation ρ, the ensemble variance becomes:

Random Forest Variance Formula (Core Equation)

Var(RF) = ρσ² + (1-ρ)σ²/B

Where:

ρ = (weighted) pairwise correlation between any two trees' predictions. Higher ρ means more "homogeneous" trees and worse ensemble performance.
σ² = variance of a single tree's predictions.
B = number of trees.

Term-by-term explanation:

ρσ² — The irreducible component that CANNOT be eliminated by adding more trees. Even as B → ∞, this term persists. It represents the "common mistakes" all trees make due to their correlation.

(1-ρ)σ²/B — The reducible component that CAN be eliminated by adding more trees. As B grows large, this term approaches zero.

Why is feature randomness so critical? Without feature randomness (i.e., plain Bagging), if the dataset has one dominant feature, nearly every tree will select it for the root split, making trees highly correlated (ρ ≈ 1) and ensemble variance ≈ σ² -- almost no reduction. Feature randomness (considering only m random features per split) forces different trees to use different feature combinations, dramatically reducing ρ and hence the ρσ² term.

2.4 Why No Pruning is Needed

In a single decision tree, pruning is essential to control variance (prevent overfitting). In Random Forest, however:

Reasons pruning is unnecessary:
1. Each unpruned tree maintains low bias (captures complex patterns)
2. Bagging + feature randomness already effectively controls variance
3. Pruning would increase bias with marginal variance reduction (since ensembling is already handling variance)

Breiman's theoretical guarantee from the original paper: The generalization error upper bound of a random forest is ρ̄(1-s²)/s², where s is each tree's "strength" (a measure of accuracy). Unpruned trees have higher s, and as long as ρ̄ is sufficiently low (guaranteed by feature randomness), the overall error bound remains small.

3. Step-by-Step Algorithm

Below is the complete Random Forest construction process, with a "WHY" explanation at each step:

Step 1: Bootstrap Sampling

From the original dataset D (with n samples), draw n samples with replacement to form a Bootstrap dataset D*.

Why ~36.8% of samples are left out (OOB samples)

Each time we draw a sample, the probability that a specific sample xᵢ is NOT selected is (1 - 1/n). Over n draws, the probability of never being selected is:

P(not selected) = (1 - 1/n)^n

As n → ∞, by the definition of the natural constant e:

lim(n→∞) (1 - 1/n)^n = 1/e ≈ 0.3679

Therefore, approximately 36.8% of samples are absent from the Bootstrap dataset -- these are the OOB (Out-of-Bag) samples. Even for moderate n (e.g., n=100), the actual proportion is very close to 36.8%.

Why use Bootstrap sampling? We have only one dataset but need to train many "different" trees. Bootstrap sampling provides a way to "simulate" multiple slightly different training sets from a single dataset, introducing sample-level randomness.

Step 2: Build Decision Tree (No Pruning)

At each node split:

2a. From all p features, randomly select m features as candidates.
2b. Among these m features, choose the best feature and threshold for splitting (using Gini impurity or information gain).
2c. Recursively repeat for left and right child nodes until leaf nodes are pure or minimum sample count is reached.

Why m ≈ √p (classification) or m ≈ p/3 (regression)?

These are empirical rules of thumb established by Breiman through extensive experiments in his 2001 paper:

Classification m ≈ √p: Classification decision boundaries are typically driven by a few key features. A smaller m suffices to find useful features at each split while maximizing inter-tree diversity (low ρ).

Regression m ≈ p/3: Regression targets typically depend on complex combinations of more features. A larger m ensures each split has access to good-enough features, preventing individual tree quality from degrading too much (σ² becoming too large).

The fundamental tradeoff: Smaller m → lower ρ (good) but higher σ² (bad). Larger m → lower σ² (good) but higher ρ (bad). The optimal m is data-dependent; the above values are empirically the most robust defaults.

Step 3: Repeat B Times

Repeat Steps 1-2 to build B independent decision trees. Typical B values range from 100 to 500. Breiman noted in his paper that increasing B does not cause overfitting (from the variance formula ρσ² + (1-ρ)σ²/B, increasing B only shrinks the second term), but computational cost increases linearly.

Step 4: Aggregate Predictions

Classification: Each tree votes; the majority vote becomes the final prediction (Majority Voting).
Regression: Each tree's prediction is averaged to produce the final prediction (Averaging).

Why does majority voting/averaging work? Condorcet's Jury Theorem (1785) provides the intuition: if each "juror" (tree) has an accuracy above 50% and their judgments are independent, then as the number of jurors increases, the collective accuracy approaches 100%. Random Forest approximates the "independence" condition through feature randomness.

4. OOB Error — Free Cross-Validation

OOB (Out-of-Bag) error is a built-in validation mechanism unique to Random Forest, proposed and validated by Breiman (2001).

4.1 How It Works

For each training sample xᵢ:
1. Identify all trees that did NOT use xᵢ in training (~36.8% × B trees).
2. Have those trees predict xᵢ and take the majority vote/average.
3. Compare the prediction to the true label.
4. Aggregate across all samples to get the OOB error rate.

4.2 Mathematical Proof of 36.8%

Why is each sample used for training by ~63.2% of trees?

Let the dataset have n samples. For tree k, the probability that sample xᵢ is included in its Bootstrap sample is:

P(xᵢ ∈ D*_k) = 1 - (1 - 1/n)^n ≈ 1 - 1/e ≈ 0.632

Therefore the probability of exclusion is ~0.368. With B trees in the forest, approximately 0.368B trees did not include xᵢ in their training set, making their predictions on xᵢ "honest" (never seen this sample) and suitable for estimating generalization error.

4.3 Why OOB Error Approximates Leave-One-Out CV

Breiman (2001) proved in his paper that the OOB estimate is asymptotically equivalent to the leave-one-out cross-validation error estimate. Intuitively: for each sample, the OOB prediction uses a "subset of models that never saw this sample," which is conceptually identical to LOO-CV's "train a model on all data except this one." The advantage of OOB is that it requires no additional computation -- it's a free byproduct of building the Random Forest.

Enabling OOB in sklearn:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, oob_score=True, random_state=42)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.4f}")
# No separate validation set needed! OOB score is an unbiased estimate of generalization

5. Feature Importance

Random Forest offers two methods for measuring feature importance. Understanding their principles and biases is crucial for correct interpretation.

5.1 MDI (Mean Decrease in Impurity) — Impurity-Based

How It Works

For each feature, compute the weighted sum of impurity decrease (Gini or information gain) across all nodes in all trees that use that feature for splitting.

MDI(feature j) = (1/B) ∑_trees ∑_nodes_using_j [N_t/N × ΔImpurity]

Advantage: Extremely fast (computed during tree construction). This is the default value returned by sklearn's feature_importances_ attribute.

Known bias (Strobl et al., 2007): MDI is systematically biased toward high-cardinality (many unique values) features. The reason: features with more unique values offer more possible split points, making them more likely to be selected for splitting even when they have no true relationship with the target. This was rigorously demonstrated by Strobl, Boulesteix, Zeileis & Hothorn in "Bias in random forest variable importance measures: Illustrations, sources and a solution" (BMC Bioinformatics, 2007).

5.2 Permutation Importance — Breiman 2001

How It Works

For each feature j, randomly shuffle (permute) its values and observe the drop in model prediction accuracy.

PI(feature j) = Accuracy_original - Accuracy_after_permuting_j

Why does this work? If a feature is important for predictions, shuffling it destroys its relationship with the target variable, causing accuracy to drop significantly. If the feature is unimportant, accuracy is barely affected.

Why it's superior to MDI:

1. Not affected by feature cardinality (it measures "how much worse do predictions get," independent of the number of unique values).

2. Can be computed on OOB data (Breiman's original version) or a separate test set, reducing overfitting risk.

3. Model-agnostic -- can be applied to any model, not just tree-based ones.

Caveat: Permutation importance has its own limitation -- when features are highly correlated, permuting one feature still leaves the correlated feature carrying partial information, leading to underestimation of importance (Strobl et al., 2008, BMC Bioinformatics).

sklearn usage example:

from sklearn.inspection import permutation_importance

# Compute permutation importance on test set (recommended)
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
for i in result.importances_mean.argsort()[::-1]:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
        print(f"{feature_names[i]:<20s} "
              f"{result.importances_mean[i]:.4f} +/- {result.importances_std[i]:.4f}")

6. Hyperparameter Tuning

Below are the key Random Forest hyperparameters, each with a "WHY it matters" explanation and recommended ranges:

Parameter	Purpose & Reasoning	Recommended Range
`n_estimators`	Number of trees B. Increasing B reduces variance (the (1-ρ)σ²/B term shrinks) and never causes overfitting. But diminishing returns: going from 100→200 helps far more than 1000→1100. More trees = higher computational cost.	100-500 (200 usually sufficient)
`max_features`	Number of features m considered per split. This is the core parameter controlling the ρ vs σ² tradeoff. Smaller m → more diverse trees (low ρ) but weaker individual trees (high σ²). And vice versa.	Classification: "sqrt" (√p) Regression: "log2" or p/3
`max_depth`	Maximum tree depth. Default None (unlimited), letting each tree grow fully to maintain low bias. Setting a limit increases bias, but may help if data is very noisy.	None (default) or 10-30
`min_samples_split`	Minimum samples required to split a node. Increasing this is equivalent to mild pruning, preventing splits on noisy samples.	2 (default) to 10
`min_samples_leaf`	Minimum samples in a leaf node. Prevents leaves with only 1-2 samples. Larger values produce smoother predictions (lower variance) but may miss local patterns (higher bias).	1 (default) to 10
`max_samples`	Bootstrap sample proportion. Available in sklearn 1.0+. Setting below 1.0 means each tree trains on fewer samples, increasing randomness (lower ρ) but potentially increasing bias.	None (full Bootstrap) or 0.7-0.9
`class_weight`	Class weights. For imbalanced datasets, use "balanced" which automatically sets weights to n_samples / (n_classes × n_samples_per_class), making misclassification of minority classes more costly.	None or "balanced"

Recommended tuning strategy (in order of priority):

from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_dist = {
    'n_estimators': [100, 200, 300, 500],
    'max_features': ['sqrt', 'log2', 0.3, 0.5],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

# RandomizedSearchCV is more efficient than GridSearchCV
# Bergstra & Bengio (2012) proved that random search finds
# better hyperparameters with the same compute budget because
# it explores the parameter space more effectively
search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist, n_iter=50, cv=5, scoring='f1_macro',
    random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.4f}")

7. Python Implementation

7.1 sklearn in Practice

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Build Random Forest
rf = RandomForestClassifier(
    n_estimators=200,       # 200 trees
    max_features='sqrt',    # consider sqrt(p) features per split
    oob_score=True,         # enable OOB evaluation
    random_state=42,
    n_jobs=-1               # use all CPU cores
)
rf.fit(X_train, y_train)

# Evaluate
print(f"OOB Score:  {rf.oob_score_:.4f}")
print(f"Test Score: {rf.score(X_test, y_test):.4f}")
print(classification_report(y_test, rf.predict(X_test),
                            target_names=data.target_names))

7.2 From-Scratch Core Concepts

Below is a simplified implementation to illustrate the core logic of Bagging + feature randomness:

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class SimpleRandomForest:
    """Simplified Random Forest demonstrating core ideas"""

    def __init__(self, n_estimators=100, max_features='sqrt', random_state=None):
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.rng = np.random.RandomState(random_state)
        self.trees = []
        self.oob_score_ = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        oob_predictions = np.zeros((n_samples, len(np.unique(y))))
        oob_counts = np.zeros(n_samples)

        for _ in range(self.n_estimators):
            # Step 1: Bootstrap sampling
            indices = self.rng.randint(0, n_samples, size=n_samples)
            oob_mask = np.ones(n_samples, dtype=bool)
            oob_mask[np.unique(indices)] = False

            # Step 2: Build decision tree (max_features provides feature randomness)
            tree = DecisionTreeClassifier(
                max_features=self.max_features,
                random_state=self.rng.randint(0, 2**31)
            )
            tree.fit(X[indices], y[indices])
            self.trees.append(tree)

            # Collect OOB predictions
            if oob_mask.any():
                oob_pred = tree.predict_proba(X[oob_mask])
                oob_predictions[oob_mask] += oob_pred
                oob_counts[oob_mask] += 1

        # Compute OOB score
        valid = oob_counts > 0
        oob_labels = np.argmax(oob_predictions[valid], axis=1)
        self.oob_score_ = np.mean(oob_labels == y[valid])
        return self

    def predict(self, X):
        # Step 4: Majority voting
        predictions = np.array([tree.predict(X) for tree in self.trees])
        from scipy.stats import mode
        return mode(predictions, axis=0, keepdims=False).mode

8. RF vs Decision Tree vs XGBoost vs SVM

Dimension	Random Forest	Decision Tree	XGBoost	SVM
Core Idea	Parallel ensemble + dual randomness (Bagging)	Greedy recursive splitting	Sequential ensemble + gradient residual correction (Boosting)	Maximum margin classifier
Bias-Variance	Low bias + medium-low variance. WHY: No pruning keeps bias low; ensembling reduces variance	Low bias + high variance. WHY: Perfectly fits training data but is noise-sensitive	Very low bias + low variance. WHY: Boosting iteratively corrects bias; regularization controls variance	Depends on kernel choice and C
Overfitting Risk	Low. Adding trees never causes overfitting (Breiman 2001 theoretical guarantee)	High. Deep trees almost certainly overfit	Medium. Too many iterations cause overfitting (needs early stopping)	Medium. Large C causes overfitting
Interpretability	Medium (feature importance available, but can't visualize one tree)	High (decision path can be directly visualized)	Medium (SHAP values provide explanations)	Low (especially with RBF kernel)
Training Speed	Fast (fully parallelizable)	Fastest (single tree)	Slower (sequential construction required)	Slow (O(n²) to O(n³))
Feature Scaling	Not required. WHY: Tree splits are threshold comparisons, invariant to feature scale	Not required (same reason)	Not required (same reason)	Required. WHY: Distance calculations are scale-dependent
Missing Values	Has built-in handling capability	Has built-in handling capability	Has built-in handling capability	Not supported; preprocessing required
Best Use Cases	General-purpose baseline; feature importance analysis; when robustness matters more than absolute peak performance	When interpretability is paramount; data exploration	Competition top scores; when peak accuracy is needed; tabular data	High-dimensional small-sample; text classification

When to Choose Random Forest Over XGBoost?

1. Default-parameter robustness: Random Forest works excellently with default parameters (Fernandez-Delgado et al., 2014, JMLR found RF with defaults ranked among the top classifiers across 121 datasets), while XGBoost requires careful tuning.

2. Parallel training: RF trees are completely independent and perfectly parallelizable. XGBoost's Boosting is inherently sequential (though intra-tree construction can be parallelized).

3. No overfitting from more trees: Adding RF trees only helps or plateaus, never hurts. Too many XGBoost iterations cause overfitting.

4. Free OOB validation: RF doesn't require a separate validation set.

9. Limitations

1. Limited Interpretability
A single decision tree can be directly visualized as a flowchart. Random Forest with hundreds of trees cannot be simply visualized. WHY: The final prediction is a vote across many trees; there is no single decision path to explain. While methods like SHAP (Lundberg & Lee, 2017, NeurIPS) can provide local explanations, the complexity is far greater than a single tree.

2. Memory and Inference Overhead
Storing 500 fully-grown decision trees requires significant memory, and prediction requires traversing all trees. WHY: Each unpruned tree may have thousands of nodes; 500 trees = millions of nodes. For latency-sensitive real-time systems, this can be a bottleneck.

3. Cannot Extrapolate
Random Forest (and all tree models) can never predict values outside the training data range. WHY: Leaf node outputs are averages of training sample labels, which cannot exceed values seen during training. For example, if the highest house price in training data is $5M, RF will never predict $6M. Linear models and neural networks do not have this limitation.

4. Poor on High-Dimensional Sparse Data
When features p >> n (e.g., text TF-IDF features), selecting only √p features per split may rarely pick useful ones. WHY: The large number of irrelevant features dilutes the probability of useful features being selected. SVMs and regularized linear models typically perform better on such data.

5. Difficulty with Smooth Decision Boundaries
Tree decision boundaries are axis-parallel rectangular regions. WHY: Each split examines only one feature's threshold, producing vertical or horizontal boundaries. For tasks requiring diagonal or curved boundaries (e.g., XOR problems), many trees are needed to approximate them, which is inefficient.

6. Class Imbalance Requires Extra Handling
Default majority voting on severely imbalanced data tends to predict the majority class. WHY: Bagging preserves the original class ratio; the majority class still dominates most trees' training sets. Use class_weight="balanced" or combine with SMOTE and similar sampling techniques.

11. FAQ

Q: Does Random Forest overfit? Does adding more trees cause overfitting?

Adding more trees (n_estimators) does NOT cause overfitting. Mathematically, in the variance formula Var(RF) = ρσ² + (1-ρ)σ²/B, increasing B only shrinks the second term (1-ρ)σ²/B while the first term ρσ² remains unchanged. Therefore generalization error monotonically decreases or plateaus, never rebounds. Breiman provided a theoretical convergence proof in Theorem 1 of his 2001 paper. However, this doesn't mean RF absolutely never overfits -- if max_depth is excessive and data is very noisy, individual tree σ² will be large, and even ensembling cannot fully eliminate the ρσ² term.

Q: What's the difference between Random Forest and Bagging?

Bagging (Breiman, 1996) uses only Bootstrap sampling to create diversity among trees; each node split still considers all features. Random Forest adds feature randomness on top of Bagging (only m random features per split). This additional randomness further reduces the pairwise correlation ρ between trees, making the ρσ² term in the variance formula smaller and improving generalization. Empirically, RF nearly always outperforms plain Bagging, especially when a few dominant features exist.

Q: How does Random Forest handle missing values?

Breiman proposed two methods in his original paper: (1) Proximity-based imputation -- using the proximity matrix produced by RF (frequency with which two samples land in the same leaf) to estimate missing values; (2) Median/mode imputation. Modern implementations (e.g., sklearn) require complete input and need preprocessing. However, LightGBM and XGBoost have built-in missing value handling: during splitting, they try assigning missing-value samples to both left and right children and choose the better direction.

Q: Why is OOB score more convenient than cross-validation? When should you NOT rely on OOB?

The OOB score requires no additional computation -- it's a "free" generalization estimate. When dataset size is sufficient (n > 1000), the OOB estimate is very reliable. However, with small datasets, each tree has fewer OOB samples (~0.368n), leading to higher variance in individual OOB predictions and a potentially unstable overall estimate. In such cases, 5-fold or 10-fold cross-validation is recommended. Additionally, if using max_samples < 1.0, the OOB proportion changes, which requires attention.

Q: How large should n_estimators be? How do I know when it's "enough"?

Breiman recommends monitoring the OOB error curve as the number of trees increases. When the curve plateaus, adding more trees provides no benefit. Practical experience: 200-500 trees achieve near-optimal performance for most problems. You can plot a learning curve to confirm: if OOB error drops less than 0.1% from 200 to 500 trees, 200 is sufficient. When computational resources are constrained, 100 trees typically provide good-enough results.

随机森林详解

Table of Contents

1. What is Random Forest

Origin of the Name

2. Why Random Forest Works — Mathematical Argument

2.1 Bias-Variance Tradeoff

Bias-Variance Decomposition

2.2 Bagging — Why Averaging Reduces Variance

Variance Reduction of Independent Random Variables

2.3 Feature Randomness — Why Reducing Tree Correlation is Critical

Random Forest Variance Formula (Core Equation)

2.4 Why No Pruning is Needed

3. Step-by-Step Algorithm

Step 1: Bootstrap Sampling

Why ~36.8% of samples are left out (OOB samples)

Step 2: Build Decision Tree (No Pruning)

Why m ≈ √p (classification) or m ≈ p/3 (regression)?

Step 3: Repeat B Times

Step 4: Aggregate Predictions

4. OOB Error — Free Cross-Validation

4.1 How It Works

4.2 Mathematical Proof of 36.8%

Why is each sample used for training by ~63.2% of trees?

4.3 Why OOB Error Approximates Leave-One-Out CV

5. Feature Importance

5.1 MDI (Mean Decrease in Impurity) — Impurity-Based

How It Works

5.2 Permutation Importance — Breiman 2001

How It Works

6. Hyperparameter Tuning

7. Python Implementation

7.1 sklearn in Practice

7.2 From-Scratch Core Concepts

8. RF vs Decision Tree vs XGBoost vs SVM

When to Choose Random Forest Over XGBoost?

9. Limitations

11. FAQ

随机森林详解

Table of Contents

1. What is Random Forest

Origin of the Name

2. Why Random Forest Works — Mathematical Argument

2.1 Bias-Variance Tradeoff

Bias-Variance Decomposition

2.2 Bagging — Why Averaging Reduces Variance

Variance Reduction of Independent Random Variables

2.3 Feature Randomness — Why Reducing Tree Correlation is Critical

Random Forest Variance Formula (Core Equation)

2.4 Why No Pruning is Needed

3. Step-by-Step Algorithm

Step 1: Bootstrap Sampling

Why ~36.8% of samples are left out (OOB samples)

Step 2: Build Decision Tree (No Pruning)

Why m ≈ √p (classification) or m ≈ p/3 (regression)?

Step 3: Repeat B Times

Step 4: Aggregate Predictions

4. OOB Error — Free Cross-Validation

4.1 How It Works

4.2 Mathematical Proof of 36.8%

Why is each sample used for training by ~63.2% of trees?

4.3 Why OOB Error Approximates Leave-One-Out CV

5. Feature Importance

5.1 MDI (Mean Decrease in Impurity) — Impurity-Based

How It Works

5.2 Permutation Importance — Breiman 2001

How It Works

6. Hyperparameter Tuning

7. Python Implementation

7.1 sklearn in Practice

7.2 From-Scratch Core Concepts

8. RF vs Decision Tree vs XGBoost vs SVM

When to Choose Random Forest Over XGBoost?

9. Limitations

10. Related Guides & Tools

11. FAQ

Related Tools