ML Algorithms Guide

What is Machine Learning

Machine Learning (ML) is a core branch of artificial intelligence that enables systems to learn patterns from data and make predictions or decisions without being explicitly programmed. ML is divided into three main paradigms: Supervised Learning (training with labeled data for classification and regression), Unsupervised Learning (discovering structure in unlabeled data via clustering and dimensionality reduction), and Reinforcement Learning (agents learning optimal policies through environment interaction). This guide focuses on the most widely used supervised and unsupervised algorithms, covering principles, code examples, pros/cons, and use cases to help you choose the right algorithm quickly.

Algorithm Selection Flowchart

Step 1: Do you have labeled data?
  |-- Yes → Supervised Learning
  |  |-- Need to predict a category? → Classification (Logistic Regression, Decision Tree, SVM, KNN...)
  |  |-- Need to predict a number? → Regression (Linear Regression, Ridge, SVR...)
  |-- No → Unsupervised Learning
     |-- Need to find groups? → Clustering (K-Means, DBSCAN...)
     |-- Need to reduce dimensions? → Dimensionality Reduction (PCA, t-SNE...)

Supervised Learning -- Classification

Logistic Regression

Despite its name, Logistic Regression is a classic binary classification algorithm. It maps a linear combination of features through the sigmoid function to produce a probability in [0,1]. The model is simple, highly interpretable, and trains extremely fast, making it an ideal baseline for classification tasks. With good feature engineering, it often delivers surprisingly strong results.

Pros: Fast training, interpretable, outputs probabilities, resists overfitting
Cons: Linear decision boundary only, needs manual feature engineering for complex patterns
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Binary classification baseline, credit scoring, ad click prediction, feature effect analysis.

Decision Tree

Decision Trees recursively split data by feature values to build a tree structure, where each leaf node corresponds to a predicted class. They use information gain (ID3), gain ratio (C4.5), or Gini impurity (CART) to select optimal splits. Trees are intuitive and can be directly visualized, but they tend to overfit and typically need pruning or ensemble methods.

Pros: Highly interpretable, no scaling needed, handles mixed feature types
Cons: Prone to overfitting, sensitive to data changes, axis-aligned boundaries
from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Explainable business decisions, feature importance analysis, rule extraction.

Random Forest

Random Forest is the classic Bagging implementation: it trains multiple decision trees on random subsets of data and features, then aggregates predictions via majority voting. This effectively reduces variance and overfitting. Random Forest is one of the most widely used algorithms in practice -- it works well out of the box, is not sensitive to hyperparameters, and provides feature importance estimates.

Pros: High accuracy, resists overfitting, parallelizable, feature importance
Cons: Large model size, slower inference, less interpretable than single tree
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") print(f"Feature importances: {model.feature_importances_}")

When to use: General-purpose classification, feature selection, first algorithm to try without domain knowledge.

SVM (Support Vector Machine)

SVM finds the maximum-margin hyperplane that separates different classes. Using the kernel trick (RBF, polynomial), it can handle non-linearly separable problems in high-dimensional space. SVM excels on small-to-medium datasets with strong generalization, but scales poorly to large datasets and is sensitive to feature scaling.

Pros: Excellent in high dimensions, strong generalization, kernel trick for non-linearity
Cons: Slow on large data, scaling-sensitive, no direct probability output
from sklearn.svm import SVC from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) model = SVC(kernel='rbf', C=1.0, gamma='scale') model.fit(X_train_s, y_train) y_pred = model.predict(X_test_s) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Text classification, image classification, bioinformatics, small-to-medium high-dimensional data.

K-Nearest Neighbors (KNN)

KNN is the most intuitive classification algorithm: for a new sample, it finds the K closest training examples and assigns the majority class. KNN is a lazy learner -- no training phase, but prediction requires scanning all training data, making it slow on large datasets. The choice of distance metric and K value significantly affects results.

Pros: No training needed, simple intuition, naturally multi-class
Cons: Slow prediction, curse of dimensionality, noise-sensitive, needs scaling
from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors=5, metric='minkowski') model.fit(X_train_s, y_train) y_pred = model.predict(X_test_s) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Recommendation prototypes, quick validation on small datasets, anomaly detection.

Naive Bayes

Naive Bayes is based on Bayes' theorem with the assumption of conditional independence between features. Despite this assumption rarely holding in reality, it performs remarkably well on text classification tasks (spam filtering, sentiment analysis). Training and prediction are extremely fast, and it handles high-dimensional sparse data well.

Pros: Extremely fast, great for sparse high-dimensional data, works with little data
Cons: Strong independence assumption, poor with continuous features
from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=5000) X_train_v = vectorizer.fit_transform(X_train_text) X_test_v = vectorizer.transform(X_test_text) model = MultinomialNB(alpha=1.0) model.fit(X_train_v, y_train) y_pred = model.predict(X_test_v) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Text classification, spam filtering, sentiment analysis, real-time classification.

Gradient Boosting (XGBoost / LightGBM)

Gradient Boosting trains weak learners (typically decision trees) sequentially, with each new model fitting the residuals of the previous round. XGBoost and LightGBM are two highly optimized implementations, dominating Kaggle competitions and industry applications. XGBoost uses second-order Taylor expansion; LightGBM uses histogram-based splits and GOSS sampling. This is the top-performing algorithm family for tabular data.

Pros: State-of-the-art accuracy, built-in regularization, handles missing values, feature importance
Cons: Many hyperparameters, slower training, can overfit small data
import xgboost as xgb model = xgb.XGBClassifier( n_estimators=200, max_depth=6, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8, random_state=42 ) model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False) y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") # LightGBM alternative import lightgbm as lgb model_lgb = lgb.LGBMClassifier(n_estimators=200, learning_rate=0.1) model_lgb.fit(X_train, y_train)

When to use: Tabular data competitions, financial risk, recommendation ranking, CTR prediction.

Supervised Learning -- Regression

Linear Regression

Linear Regression fits a linear relationship between features and the target variable using ordinary least squares. It is the baseline model for regression problems -- simple, efficient, and highly interpretable. The model outputs coefficients for each feature, showing the direction and strength of their effect on the target.

Pros: Extremely fast, interpretable, solid mathematical foundation
Cons: Only fits linear relationships, sensitive to outliers
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}") print(f"R2: {r2_score(y_test, y_pred):.4f}") print(f"Coefficients: {model.coef_}")

When to use: Price prediction baseline, trend analysis, economics modeling.

Ridge / Lasso Regression

Ridge (L2 regularization) and Lasso (L1 regularization) add penalty terms to Linear Regression to prevent overfitting. Ridge shrinks all coefficients but never sets them to zero, ideal for multicollinearity. Lasso can shrink coefficients to exactly zero, performing automatic feature selection. ElasticNet combines both approaches.

Pros: Prevents overfitting, Lasso does feature selection, handles multicollinearity
Cons: Still limited to linear relationships, alpha parameter needs tuning
from sklearn.linear_model import Ridge, Lasso, ElasticNet ridge = Ridge(alpha=1.0).fit(X_train, y_train) lasso = Lasso(alpha=0.1).fit(X_train, y_train) enet = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_train, y_train) # Lasso automatic feature selection: zero coefficients = dropped features print(f"Lasso non-zero features: {(lasso.coef_ != 0).sum()}")

When to use: High-dimensional regression, feature selection, datasets with multicollinearity.

Decision Tree Regression

Decision Tree Regression uses the same recursive splitting strategy as classification trees, but outputs continuous values at leaf nodes (typically the mean of target values in that region). It captures non-linear relationships without feature scaling, but risks overfitting.

Pros: Captures non-linearity, no scaling needed, intuitive
Cons: Prone to overfitting, discontinuous predictions (step-like)
from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor(max_depth=8, min_samples_leaf=20) model.fit(X_train, y_train) print(f"R2: {model.score(X_test, y_test):.4f}")

Random Forest Regression

Random Forest Regression ensembles multiple decision regression trees and averages their predictions. It inherits Random Forest's resistance to overfitting and is one of the most reliable choices for regression tasks, especially on medium-sized datasets.

Pros: High accuracy, robust, feature importance estimation
Cons: Cannot extrapolate (predictions stay within training range), large model
from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=100, max_depth=12, random_state=42) model.fit(X_train, y_train) print(f"R2: {model.score(X_test, y_test):.4f}")

SVR (Support Vector Regression)

SVR extends SVM to regression by defining an epsilon-insensitive loss function that ignores errors within a tube around the prediction. With kernel tricks it can fit complex non-linear relationships, performing well on small-to-medium datasets.

Pros: Kernel trick for non-linearity, robust, good generalization
Cons: Slow on large data, requires scaling, difficult hyperparameter tuning
from sklearn.svm import SVR model = SVR(kernel='rbf', C=100, epsilon=0.1) model.fit(X_train_s, y_train) print(f"R2: {model.score(X_test_s, y_test):.4f}")

Unsupervised Learning -- Clustering

K-Means

K-Means is the most widely used clustering algorithm. It partitions data into K clusters by iteratively assigning samples to the nearest centroid and updating centroids. Simple and efficient, it works best with spherical cluster shapes. K must be specified in advance -- use the elbow method or silhouette score to choose.

Pros: Simple, efficient, scales to large datasets
Cons: Requires K upfront, only spherical clusters, sensitive to initialization and outliers
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Elbow method to choose K inertias = [] for k in range(2, 11): km = KMeans(n_clusters=k, random_state=42, n_init=10) km.fit(X_scaled) inertias.append(km.inertia_) # Final model model = KMeans(n_clusters=4, random_state=42, n_init=10) labels = model.fit_predict(X_scaled) print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.4f}")

DBSCAN

DBSCAN (Density-Based Spatial Clustering) identifies clusters by defining core points, border points, and noise points. It automatically discovers clusters of arbitrary shape without requiring K, and naturally identifies outliers. The parameters eps (neighborhood radius) and min_samples heavily influence results.

Pros: No K required, finds arbitrary-shaped clusters, automatic outlier detection
Cons: Parameter-sensitive, struggles with varying density clusters
from sklearn.cluster import DBSCAN model = DBSCAN(eps=0.5, min_samples=5) labels = model.fit_predict(X_scaled) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = (labels == -1).sum() print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Hierarchical Clustering

Hierarchical Clustering builds a tree-like structure (dendrogram) of clusters, either bottom-up (agglomerative) or top-down (divisive). The dendrogram allows intuitive selection of cluster count. No K is needed upfront, but O(n^3) complexity makes it unsuitable for large datasets.

Pros: No K required, dendrogram visualization, flexible distance metrics
Cons: High computational complexity, irreversible merges (agglomerative)
from sklearn.cluster import AgglomerativeClustering model = AgglomerativeClustering(n_clusters=4, linkage='ward') labels = model.fit_predict(X_scaled) print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.4f}")

Dimensionality Reduction

PCA (Principal Component Analysis)

PCA projects data onto directions of maximum variance through orthogonal transformation, reducing feature dimensionality. It retains the most important information (components explaining the most variance) and is widely used for visualization, noise filtering, and speeding up downstream models. PCA is a linear method, best suited for data with linear structure.

Pros: Efficient, denoising, eliminates multicollinearity
Cons: Only captures linear structure, components are hard to interpret
from sklearn.decomposition import PCA import numpy as np pca = PCA(n_components=0.95) # retain 95% variance X_reduced = pca.fit_transform(X_scaled) print(f"Original dims: {X_scaled.shape[1]}") print(f"Reduced dims: {X_reduced.shape[1]}") print(f"Explained variance: {np.sum(pca.explained_variance_ratio_):.4f}")

t-SNE

t-SNE is a non-linear dimensionality reduction method, particularly strong at visualizing high-dimensional data in 2D or 3D. It preserves local similarity structure between data points. t-SNE is primarily used for exploratory data analysis and visualization -- it is not suitable for feature engineering or as a preprocessing step for downstream models.

Pros: Excellent visualizations, preserves local structure
Cons: Slow computation, non-deterministic, not for downstream modeling
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, perplexity=30, random_state=42) X_2d = tsne.fit_transform(X_scaled) # Typically followed by matplotlib visualization # plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10')

Algorithm Comparison Table

AlgorithmTypeComplexityInterpretabilityNon-linearBest For
Logistic RegressionClassificationO(nd)HighNoBinary baseline
Decision TreeClassif/RegrO(nd log n)Very HighYesRule extraction
Random ForestClassif/RegrO(knd log n)MediumYesGeneral purpose
SVM / SVRClassif/RegrO(n^2 ~ n^3)LowYes (kernel)Small-medium data
KNNClassif/RegrO(nd) predictHighYesSmall data proto
Naive BayesClassificationO(nd)MediumNoText classification
XGBoost/LGBMClassif/RegrO(knd log n)MediumYesTabular competitions
Linear RegressionRegressionO(nd^2)Very HighNoRegression baseline
Ridge / LassoRegressionO(nd^2)HighNoRegularized regr.
K-MeansClusteringO(nkd)HighNoCustomer segments
DBSCANClusteringO(n log n)MediumYesAnomaly detection
PCADim. ReductionO(nd^2)LowNoFeature compression
t-SNEDim. ReductionO(n^2)LowYesData visualization

Evaluation Metrics

Classification Metrics

Accuracy: Correct predictions / Total samples. Accuracy = (TP + TN) / (TP + TN + FP + FN)
Best for balanced datasets.

Precision: True positives among predicted positives. Precision = TP / (TP + FP)
Use when false positive cost is high (e.g., spam filtering).

Recall: True positives among actual positives. Recall = TP / (TP + FN)
Use when false negative cost is high (e.g., disease diagnosis).

F1 Score: Harmonic mean of Precision and Recall. F1 = 2 * Precision * Recall / (Precision + Recall)
More meaningful than accuracy for imbalanced classes.

AUC-ROC: Area Under the ROC Curve, measuring classification ability across all thresholds. AUC = 1 is perfect, 0.5 is random
Threshold-independent, ideal for comparing models.

from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report) print(classification_report(y_test, y_pred)) print(f"AUC-ROC: {roc_auc_score(y_test, y_prob[:, 1]):.4f}")

Regression Metrics

MSE (Mean Squared Error): MSE = (1/n) * sum((y_true - y_pred)^2)
Penalizes large errors more heavily.

RMSE (Root Mean Squared Error): RMSE = sqrt(MSE)
Same unit as target variable, easier to interpret.

MAE (Mean Absolute Error): MAE = (1/n) * sum(|y_true - y_pred|)
Robust to outliers.

R-squared: R^2 = 1 - SS_res / SS_tot
Proportion of variance explained. 1.0 = perfect fit, 0 = predicting the mean.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score import numpy as np mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"MSE: {mse:.4f} | RMSE: {rmse:.4f} | MAE: {mae:.4f} | R2: {r2:.4f}")

FAQ

Q: How do I choose between classification and regression?

If your target variable is a discrete category (e.g., "yes/no", "cat/dog/bird"), use classification. If it is a continuous number (e.g., "house price", "temperature"), use regression. While you can discretize continuous values for classification or encode categories as numbers for regression, this is generally not recommended.

Q: Random Forest vs. XGBoost -- which is better?

Neither is universally better. Random Forest is simpler, has fewer hyperparameters, and is more resistant to overfitting -- great for quick modeling. XGBoost/LightGBM typically achieves higher accuracy but requires more tuning. In practice, try both and compare via cross-validation. In Kaggle competitions, gradient boosting methods usually win.

Q: When do I need feature scaling (standardization/normalization)?

Distance-based algorithms (KNN, SVM, K-Means) and gradient-descent models (Logistic Regression, neural networks) need feature scaling. Tree-based models (Decision Tree, Random Forest, XGBoost) do not need scaling because they split on feature value ranks, not magnitudes.

Q: Which algorithm should I use with small datasets?

For small datasets (<1000 samples): SVM (good generalization), Naive Bayes (effective with little data), KNN (simple and direct). Avoid deep learning and complex ensembles as they tend to overfit. Also use cross-validation instead of a simple train/test split for evaluation.

Q: How do I handle class imbalance?

Common strategies: 1) Oversample the minority class (SMOTE); 2) Undersample the majority class; 3) Adjust class weights (class_weight='balanced'); 4) Use F1/AUC-ROC instead of accuracy as evaluation metric; 5) Use ensemble methods like BalancedRandomForest. The best strategy depends on data size and imbalance ratio.