ML Algorithms Guide
What is Machine Learning
Machine Learning (ML) is a core branch of artificial intelligence that enables systems to learn patterns from data and make predictions or decisions without being explicitly programmed. ML is divided into three main paradigms: Supervised Learning (training with labeled data for classification and regression), Unsupervised Learning (discovering structure in unlabeled data via clustering and dimensionality reduction), and Reinforcement Learning (agents learning optimal policies through environment interaction). This guide focuses on the most widely used supervised and unsupervised algorithms, covering principles, code examples, pros/cons, and use cases to help you choose the right algorithm quickly.
Algorithm Selection Flowchart
|-- Yes → Supervised Learning
| |-- Need to predict a category? → Classification (Logistic Regression, Decision Tree, SVM, KNN...)
| |-- Need to predict a number? → Regression (Linear Regression, Ridge, SVR...)
|-- No → Unsupervised Learning
|-- Need to find groups? → Clustering (K-Means, DBSCAN...)
|-- Need to reduce dimensions? → Dimensionality Reduction (PCA, t-SNE...)
Supervised Learning -- Classification
Logistic Regression
Despite its name, Logistic Regression is a classic binary classification algorithm. It maps a linear combination of features through the sigmoid function to produce a probability in [0,1]. The model is simple, highly interpretable, and trains extremely fast, making it an ideal baseline for classification tasks. With good feature engineering, it often delivers surprisingly strong results.
When to use: Binary classification baseline, credit scoring, ad click prediction, feature effect analysis.
Decision Tree
Decision Trees recursively split data by feature values to build a tree structure, where each leaf node corresponds to a predicted class. They use information gain (ID3), gain ratio (C4.5), or Gini impurity (CART) to select optimal splits. Trees are intuitive and can be directly visualized, but they tend to overfit and typically need pruning or ensemble methods.
When to use: Explainable business decisions, feature importance analysis, rule extraction.
Random Forest
Random Forest is the classic Bagging implementation: it trains multiple decision trees on random subsets of data and features, then aggregates predictions via majority voting. This effectively reduces variance and overfitting. Random Forest is one of the most widely used algorithms in practice -- it works well out of the box, is not sensitive to hyperparameters, and provides feature importance estimates.
When to use: General-purpose classification, feature selection, first algorithm to try without domain knowledge.
SVM (Support Vector Machine)
SVM finds the maximum-margin hyperplane that separates different classes. Using the kernel trick (RBF, polynomial), it can handle non-linearly separable problems in high-dimensional space. SVM excels on small-to-medium datasets with strong generalization, but scales poorly to large datasets and is sensitive to feature scaling.
When to use: Text classification, image classification, bioinformatics, small-to-medium high-dimensional data.
K-Nearest Neighbors (KNN)
KNN is the most intuitive classification algorithm: for a new sample, it finds the K closest training examples and assigns the majority class. KNN is a lazy learner -- no training phase, but prediction requires scanning all training data, making it slow on large datasets. The choice of distance metric and K value significantly affects results.
When to use: Recommendation prototypes, quick validation on small datasets, anomaly detection.
Naive Bayes
Naive Bayes is based on Bayes' theorem with the assumption of conditional independence between features. Despite this assumption rarely holding in reality, it performs remarkably well on text classification tasks (spam filtering, sentiment analysis). Training and prediction are extremely fast, and it handles high-dimensional sparse data well.
When to use: Text classification, spam filtering, sentiment analysis, real-time classification.
Gradient Boosting (XGBoost / LightGBM)
Gradient Boosting trains weak learners (typically decision trees) sequentially, with each new model fitting the residuals of the previous round. XGBoost and LightGBM are two highly optimized implementations, dominating Kaggle competitions and industry applications. XGBoost uses second-order Taylor expansion; LightGBM uses histogram-based splits and GOSS sampling. This is the top-performing algorithm family for tabular data.
When to use: Tabular data competitions, financial risk, recommendation ranking, CTR prediction.
Supervised Learning -- Regression
Linear Regression
Linear Regression fits a linear relationship between features and the target variable using ordinary least squares. It is the baseline model for regression problems -- simple, efficient, and highly interpretable. The model outputs coefficients for each feature, showing the direction and strength of their effect on the target.
When to use: Price prediction baseline, trend analysis, economics modeling.
Ridge / Lasso Regression
Ridge (L2 regularization) and Lasso (L1 regularization) add penalty terms to Linear Regression to prevent overfitting. Ridge shrinks all coefficients but never sets them to zero, ideal for multicollinearity. Lasso can shrink coefficients to exactly zero, performing automatic feature selection. ElasticNet combines both approaches.
When to use: High-dimensional regression, feature selection, datasets with multicollinearity.
Decision Tree Regression
Decision Tree Regression uses the same recursive splitting strategy as classification trees, but outputs continuous values at leaf nodes (typically the mean of target values in that region). It captures non-linear relationships without feature scaling, but risks overfitting.
Random Forest Regression
Random Forest Regression ensembles multiple decision regression trees and averages their predictions. It inherits Random Forest's resistance to overfitting and is one of the most reliable choices for regression tasks, especially on medium-sized datasets.
SVR (Support Vector Regression)
SVR extends SVM to regression by defining an epsilon-insensitive loss function that ignores errors within a tube around the prediction. With kernel tricks it can fit complex non-linear relationships, performing well on small-to-medium datasets.
Unsupervised Learning -- Clustering
K-Means
K-Means is the most widely used clustering algorithm. It partitions data into K clusters by iteratively assigning samples to the nearest centroid and updating centroids. Simple and efficient, it works best with spherical cluster shapes. K must be specified in advance -- use the elbow method or silhouette score to choose.
DBSCAN
DBSCAN (Density-Based Spatial Clustering) identifies clusters by defining core points, border points, and noise points. It automatically discovers clusters of arbitrary shape without requiring K, and naturally identifies outliers. The parameters eps (neighborhood radius) and min_samples heavily influence results.
Hierarchical Clustering
Hierarchical Clustering builds a tree-like structure (dendrogram) of clusters, either bottom-up (agglomerative) or top-down (divisive). The dendrogram allows intuitive selection of cluster count. No K is needed upfront, but O(n^3) complexity makes it unsuitable for large datasets.
Dimensionality Reduction
PCA (Principal Component Analysis)
PCA projects data onto directions of maximum variance through orthogonal transformation, reducing feature dimensionality. It retains the most important information (components explaining the most variance) and is widely used for visualization, noise filtering, and speeding up downstream models. PCA is a linear method, best suited for data with linear structure.
t-SNE
t-SNE is a non-linear dimensionality reduction method, particularly strong at visualizing high-dimensional data in 2D or 3D. It preserves local similarity structure between data points. t-SNE is primarily used for exploratory data analysis and visualization -- it is not suitable for feature engineering or as a preprocessing step for downstream models.
Algorithm Comparison Table
| Algorithm | Type | Complexity | Interpretability | Non-linear | Best For |
|---|---|---|---|---|---|
| Logistic Regression | Classification | O(nd) | High | No | Binary baseline |
| Decision Tree | Classif/Regr | O(nd log n) | Very High | Yes | Rule extraction |
| Random Forest | Classif/Regr | O(knd log n) | Medium | Yes | General purpose |
| SVM / SVR | Classif/Regr | O(n^2 ~ n^3) | Low | Yes (kernel) | Small-medium data |
| KNN | Classif/Regr | O(nd) predict | High | Yes | Small data proto |
| Naive Bayes | Classification | O(nd) | Medium | No | Text classification |
| XGBoost/LGBM | Classif/Regr | O(knd log n) | Medium | Yes | Tabular competitions |
| Linear Regression | Regression | O(nd^2) | Very High | No | Regression baseline |
| Ridge / Lasso | Regression | O(nd^2) | High | No | Regularized regr. |
| K-Means | Clustering | O(nkd) | High | No | Customer segments |
| DBSCAN | Clustering | O(n log n) | Medium | Yes | Anomaly detection |
| PCA | Dim. Reduction | O(nd^2) | Low | No | Feature compression |
| t-SNE | Dim. Reduction | O(n^2) | Low | Yes | Data visualization |
Evaluation Metrics
Classification Metrics
Accuracy: Correct predictions / Total samples. Accuracy = (TP + TN) / (TP + TN + FP + FN)
Best for balanced datasets.
Precision: True positives among predicted positives. Precision = TP / (TP + FP)
Use when false positive cost is high (e.g., spam filtering).
Recall: True positives among actual positives. Recall = TP / (TP + FN)
Use when false negative cost is high (e.g., disease diagnosis).
F1 Score: Harmonic mean of Precision and Recall. F1 = 2 * Precision * Recall / (Precision + Recall)
More meaningful than accuracy for imbalanced classes.
AUC-ROC: Area Under the ROC Curve, measuring classification ability across all thresholds. AUC = 1 is perfect, 0.5 is random
Threshold-independent, ideal for comparing models.
Regression Metrics
MSE (Mean Squared Error): MSE = (1/n) * sum((y_true - y_pred)^2)
Penalizes large errors more heavily.
RMSE (Root Mean Squared Error): RMSE = sqrt(MSE)
Same unit as target variable, easier to interpret.
MAE (Mean Absolute Error): MAE = (1/n) * sum(|y_true - y_pred|)
Robust to outliers.
R-squared: R^2 = 1 - SS_res / SS_tot
Proportion of variance explained. 1.0 = perfect fit, 0 = predicting the mean.
FAQ
If your target variable is a discrete category (e.g., "yes/no", "cat/dog/bird"), use classification. If it is a continuous number (e.g., "house price", "temperature"), use regression. While you can discretize continuous values for classification or encode categories as numbers for regression, this is generally not recommended.
Neither is universally better. Random Forest is simpler, has fewer hyperparameters, and is more resistant to overfitting -- great for quick modeling. XGBoost/LightGBM typically achieves higher accuracy but requires more tuning. In practice, try both and compare via cross-validation. In Kaggle competitions, gradient boosting methods usually win.
Distance-based algorithms (KNN, SVM, K-Means) and gradient-descent models (Logistic Regression, neural networks) need feature scaling. Tree-based models (Decision Tree, Random Forest, XGBoost) do not need scaling because they split on feature value ranks, not magnitudes.
For small datasets (<1000 samples): SVM (good generalization), Naive Bayes (effective with little data), KNN (simple and direct). Avoid deep learning and complex ensembles as they tend to overfit. Also use cross-validation instead of a simple train/test split for evaluation.
Common strategies: 1) Oversample the minority class (SMOTE); 2) Undersample the majority class; 3) Adjust class weights (class_weight='balanced'); 4) Use F1/AUC-ROC instead of accuracy as evaluation metric; 5) Use ensemble methods like BalancedRandomForest. The best strategy depends on data size and imbalance ratio.