What is Machine Learning

Machine Learning (ML) is a core branch of artificial intelligence that enables systems to learn patterns from data and make predictions or decisions without being explicitly programmed. ML is divided into three main paradigms: Supervised Learning (training with labeled data for classification and regression), Unsupervised Learning (discovering structure in unlabeled data via clustering and dimensionality reduction), and Reinforcement Learning (agents learning optimal policies through environment interaction). This guide focuses on the most widely used supervised and unsupervised algorithms, covering principles, code examples, pros/cons, and use cases to help you choose the right algorithm quickly.

Algorithm Selection Flowchart

Step 1: Do you have labeled data?
  |-- Yes → Supervised Learning
  |  |-- Need to predict a category? → Classification (Logistic Regression, Decision Tree, SVM, KNN...)
  |  |-- Need to predict a number? → Regression (Linear Regression, Ridge, SVR...)
  |-- No → Unsupervised Learning
     |-- Need to find groups? → Clustering (K-Means, DBSCAN...)
     |-- Need to reduce dimensions? → Dimensionality Reduction (PCA, t-SNE...)

Supervised Learning -- Classification

Logistic Regression

Despite its name, Logistic Regression is a classic binary classification algorithm. It maps a linear combination of features through the sigmoid function to produce a probability in [0,1]. The model is simple, highly interpretable, and trains extremely fast, making it an ideal baseline for classification tasks. With good feature engineering, it often delivers surprisingly strong results.

Pros: Fast training, interpretable, outputs probabilities, resists overfitting

Cons: Linear decision boundary only, needs manual feature engineering for complex patterns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Binary classification baseline, credit scoring, ad click prediction, feature effect analysis.

Decision Tree

Decision Trees recursively split data by feature values to build a tree structure, where each leaf node corresponds to a predicted class. They use information gain (ID3), gain ratio (C4.5), or Gini impurity (CART) to select optimal splits. Trees are intuitive and can be directly visualized, but they tend to overfit and typically need pruning or ensemble methods.

Pros: Highly interpretable, no scaling needed, handles mixed feature types

Cons: Prone to overfitting, sensitive to data changes, axis-aligned boundaries

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Explainable business decisions, feature importance analysis, rule extraction.

Random Forest

Random Forest is the classic Bagging implementation: it trains multiple decision trees on random subsets of data and features, then aggregates predictions via majority voting. This effectively reduces variance and overfitting. Random Forest is one of the most widely used algorithms in practice -- it works well out of the box, is not sensitive to hyperparameters, and provides feature importance estimates.

Pros: High accuracy, resists overfitting, parallelizable, feature importance

Cons: Large model size, slower inference, less interpretable than single tree

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Feature importances: {model.feature_importances_}")

When to use: General-purpose classification, feature selection, first algorithm to try without domain knowledge.

SVM (Support Vector Machine)

SVM finds the maximum-margin hyperplane that separates different classes. Using the kernel trick (RBF, polynomial), it can handle non-linearly separable problems in high-dimensional space. SVM excels on small-to-medium datasets with strong generalization, but scales poorly to large datasets and is sensitive to feature scaling.

Pros: Excellent in high dimensions, strong generalization, kernel trick for non-linearity

Cons: Slow on large data, scaling-sensitive, no direct probability output

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Text classification, image classification, bioinformatics, small-to-medium high-dimensional data.

K-Nearest Neighbors (KNN)

KNN is the most intuitive classification algorithm: for a new sample, it finds the K closest training examples and assigns the majority class. KNN is a lazy learner -- no training phase, but prediction requires scanning all training data, making it slow on large datasets. The choice of distance metric and K value significantly affects results.

Pros: No training needed, simple intuition, naturally multi-class

Cons: Slow prediction, curse of dimensionality, noise-sensitive, needs scaling

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5, metric='minkowski')
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Recommendation prototypes, quick validation on small datasets, anomaly detection.

Naive Bayes

Naive Bayes is based on Bayes' theorem with the assumption of conditional independence between features. Despite this assumption rarely holding in reality, it performs remarkably well on text classification tasks (spam filtering, sentiment analysis). Training and prediction are extremely fast, and it handles high-dimensional sparse data well.

Pros: Extremely fast, great for sparse high-dimensional data, works with little data

Cons: Strong independence assumption, poor with continuous features

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train_v = vectorizer.fit_transform(X_train_text)
X_test_v = vectorizer.transform(X_test_text)

model = MultinomialNB(alpha=1.0)
model.fit(X_train_v, y_train)
y_pred = model.predict(X_test_v)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

When to use: Text classification, spam filtering, sentiment analysis, real-time classification.

Gradient Boosting (XGBoost / LightGBM)

Gradient Boosting trains weak learners (typically decision trees) sequentially, with each new model fitting the residuals of the previous round. XGBoost and LightGBM are two highly optimized implementations, dominating Kaggle competitions and industry applications. XGBoost uses second-order Taylor expansion; LightGBM uses histogram-based splits and GOSS sampling. This is the top-performing algorithm family for tabular data.

Pros: State-of-the-art accuracy, built-in regularization, handles missing values, feature importance

Cons: Many hyperparameters, slower training, can overfit small data

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=200, max_depth=6, learning_rate=0.1,
    subsample=0.8, colsample_bytree=0.8, random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

# LightGBM alternative
import lightgbm as lgb
model_lgb = lgb.LGBMClassifier(n_estimators=200, learning_rate=0.1)
model_lgb.fit(X_train, y_train)

When to use: Tabular data competitions, financial risk, recommendation ranking, CTR prediction.

Supervised Learning -- Regression

Linear Regression

Linear Regression fits a linear relationship between features and the target variable using ordinary least squares. It is the baseline model for regression problems -- simple, efficient, and highly interpretable. The model outputs coefficients for each feature, showing the direction and strength of their effect on the target.

Pros: Extremely fast, interpretable, solid mathematical foundation

Cons: Only fits linear relationships, sensitive to outliers

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"R2:  {r2_score(y_test, y_pred):.4f}")
print(f"Coefficients: {model.coef_}")

When to use: Price prediction baseline, trend analysis, economics modeling.

Ridge / Lasso Regression

Ridge (L2 regularization) and Lasso (L1 regularization) add penalty terms to Linear Regression to prevent overfitting. Ridge shrinks all coefficients but never sets them to zero, ideal for multicollinearity. Lasso can shrink coefficients to exactly zero, performing automatic feature selection. ElasticNet combines both approaches.

Pros: Prevents overfitting, Lasso does feature selection, handles multicollinearity

Cons: Still limited to linear relationships, alpha parameter needs tuning

from sklearn.linear_model import Ridge, Lasso, ElasticNet

ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
enet  = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_train, y_train)

# Lasso automatic feature selection: zero coefficients = dropped features
print(f"Lasso non-zero features: {(lasso.coef_ != 0).sum()}")

When to use: High-dimensional regression, feature selection, datasets with multicollinearity.

Decision Tree Regression

Decision Tree Regression uses the same recursive splitting strategy as classification trees, but outputs continuous values at leaf nodes (typically the mean of target values in that region). It captures non-linear relationships without feature scaling, but risks overfitting.

Pros: Captures non-linearity, no scaling needed, intuitive

Cons: Prone to overfitting, discontinuous predictions (step-like)

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(max_depth=8, min_samples_leaf=20)
model.fit(X_train, y_train)
print(f"R2: {model.score(X_test, y_test):.4f}")

Random Forest Regression

Random Forest Regression ensembles multiple decision regression trees and averages their predictions. It inherits Random Forest's resistance to overfitting and is one of the most reliable choices for regression tasks, especially on medium-sized datasets.

Pros: High accuracy, robust, feature importance estimation

Cons: Cannot extrapolate (predictions stay within training range), large model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, max_depth=12, random_state=42)
model.fit(X_train, y_train)
print(f"R2: {model.score(X_test, y_test):.4f}")

SVR (Support Vector Regression)

SVR extends SVM to regression by defining an epsilon-insensitive loss function that ignores errors within a tube around the prediction. With kernel tricks it can fit complex non-linear relationships, performing well on small-to-medium datasets.

Pros: Kernel trick for non-linearity, robust, good generalization

Cons: Slow on large data, requires scaling, difficult hyperparameter tuning

from sklearn.svm import SVR

model = SVR(kernel='rbf', C=100, epsilon=0.1)
model.fit(X_train_s, y_train)
print(f"R2: {model.score(X_test_s, y_test):.4f}")

Unsupervised Learning -- Clustering

K-Means

K-Means is the most widely used clustering algorithm. It partitions data into K clusters by iteratively assigning samples to the nearest centroid and updating centroids. Simple and efficient, it works best with spherical cluster shapes. K must be specified in advance -- use the elbow method or silhouette score to choose.

Pros: Simple, efficient, scales to large datasets

Cons: Requires K upfront, only spherical clusters, sensitive to initialization and outliers

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Elbow method to choose K
inertias = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

# Final model
model = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = model.fit_predict(X_scaled)
print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.4f}")

DBSCAN

DBSCAN (Density-Based Spatial Clustering) identifies clusters by defining core points, border points, and noise points. It automatically discovers clusters of arbitrary shape without requiring K, and naturally identifies outliers. The parameters eps (neighborhood radius) and min_samples heavily influence results.

Pros: No K required, finds arbitrary-shaped clusters, automatic outlier detection

Cons: Parameter-sensitive, struggles with varying density clusters

from sklearn.cluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5)
labels = model.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters: {n_clusters}, Noise points: {n_noise}")

Hierarchical Clustering

Hierarchical Clustering builds a tree-like structure (dendrogram) of clusters, either bottom-up (agglomerative) or top-down (divisive). The dendrogram allows intuitive selection of cluster count. No K is needed upfront, but O(n^3) complexity makes it unsuitable for large datasets.

Pros: No K required, dendrogram visualization, flexible distance metrics

Cons: High computational complexity, irreversible merges (agglomerative)

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels = model.fit_predict(X_scaled)
print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.4f}")

Dimensionality Reduction

PCA (Principal Component Analysis)

PCA projects data onto directions of maximum variance through orthogonal transformation, reducing feature dimensionality. It retains the most important information (components explaining the most variance) and is widely used for visualization, noise filtering, and speeding up downstream models. PCA is a linear method, best suited for data with linear structure.

Pros: Efficient, denoising, eliminates multicollinearity

Cons: Only captures linear structure, components are hard to interpret

from sklearn.decomposition import PCA
import numpy as np

pca = PCA(n_components=0.95)  # retain 95% variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Original dims: {X_scaled.shape[1]}")
print(f"Reduced dims: {X_reduced.shape[1]}")
print(f"Explained variance: {np.sum(pca.explained_variance_ratio_):.4f}")

t-SNE

t-SNE is a non-linear dimensionality reduction method, particularly strong at visualizing high-dimensional data in 2D or 3D. It preserves local similarity structure between data points. t-SNE is primarily used for exploratory data analysis and visualization -- it is not suitable for feature engineering or as a preprocessing step for downstream models.

Pros: Excellent visualizations, preserves local structure

Cons: Slow computation, non-deterministic, not for downstream modeling

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_scaled)
# Typically followed by matplotlib visualization
# plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10')

Algorithm Comparison Table

Algorithm	Type	Complexity	Interpretability	Non-linear	Best For
Logistic Regression	Classification	O(nd)	High	No	Binary baseline
Decision Tree	Classif/Regr	O(nd log n)	Very High	Yes	Rule extraction
Random Forest	Classif/Regr	O(knd log n)	Medium	Yes	General purpose
SVM / SVR	Classif/Regr	O(n^2 ~ n^3)	Low	Yes (kernel)	Small-medium data
KNN	Classif/Regr	O(nd) predict	High	Yes	Small data proto
Naive Bayes	Classification	O(nd)	Medium	No	Text classification
XGBoost/LGBM	Classif/Regr	O(knd log n)	Medium	Yes	Tabular competitions
Linear Regression	Regression	O(nd^2)	Very High	No	Regression baseline
Ridge / Lasso	Regression	O(nd^2)	High	No	Regularized regr.
K-Means	Clustering	O(nkd)	High	No	Customer segments
DBSCAN	Clustering	O(n log n)	Medium	Yes	Anomaly detection
PCA	Dim. Reduction	O(nd^2)	Low	No	Feature compression
t-SNE	Dim. Reduction	O(n^2)	Low	Yes	Data visualization

Evaluation Metrics

Classification Metrics

Accuracy: Correct predictions / Total samples. Accuracy = (TP + TN) / (TP + TN + FP + FN)
Best for balanced datasets.

Precision: True positives among predicted positives. Precision = TP / (TP + FP)
Use when false positive cost is high (e.g., spam filtering).

Recall: True positives among actual positives. Recall = TP / (TP + FN)
Use when false negative cost is high (e.g., disease diagnosis).

F1 Score: Harmonic mean of Precision and Recall. F1 = 2 * Precision * Recall / (Precision + Recall)
More meaningful than accuracy for imbalanced classes.

AUC-ROC: Area Under the ROC Curve, measuring classification ability across all thresholds. AUC = 1 is perfect, 0.5 is random
Threshold-independent, ideal for comparing models.

from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                               f1_score, roc_auc_score, classification_report)

print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob[:, 1]):.4f}")

Regression Metrics

MSE (Mean Squared Error): MSE = (1/n) * sum((y_true - y_pred)^2)
Penalizes large errors more heavily.

RMSE (Root Mean Squared Error): RMSE = sqrt(MSE)
Same unit as target variable, easier to interpret.

MAE (Mean Absolute Error): MAE = (1/n) * sum(|y_true - y_pred|)
Robust to outliers.

R-squared: R^2 = 1 - SS_res / SS_tot
Proportion of variance explained. 1.0 = perfect fit, 0 = predicting the mean.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

mse  = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae  = mean_absolute_error(y_test, y_pred)
r2   = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f} | RMSE: {rmse:.4f} | MAE: {mae:.4f} | R2: {r2:.4f}")

Related Tools

Statistics Calculator -- Mean, StdDev, Distributions & more

FAQ

Q: How do I choose between classification and regression?

If your target variable is a discrete category (e.g., "yes/no", "cat/dog/bird"), use classification. If it is a continuous number (e.g., "house price", "temperature"), use regression. While you can discretize continuous values for classification or encode categories as numbers for regression, this is generally not recommended.

Q: Random Forest vs. XGBoost -- which is better?

Neither is universally better. Random Forest is simpler, has fewer hyperparameters, and is more resistant to overfitting -- great for quick modeling. XGBoost/LightGBM typically achieves higher accuracy but requires more tuning. In practice, try both and compare via cross-validation. In Kaggle competitions, gradient boosting methods usually win.

Q: When do I need feature scaling (standardization/normalization)?

Distance-based algorithms (KNN, SVM, K-Means) and gradient-descent models (Logistic Regression, neural networks) need feature scaling. Tree-based models (Decision Tree, Random Forest, XGBoost) do not need scaling because they split on feature value ranks, not magnitudes.

Q: Which algorithm should I use with small datasets?

For small datasets (<1000 samples): SVM (good generalization), Naive Bayes (effective with little data), KNN (simple and direct). Avoid deep learning and complex ensembles as they tend to overfit. Also use cross-validation instead of a simple train/test split for evaluation.

Q: How do I handle class imbalance?

Common strategies: 1) Oversample the minority class (SMOTE); 2) Undersample the majority class; 3) Adjust class weights (class_weight='balanced'); 4) Use F1/AUC-ROC instead of accuracy as evaluation metric; 5) Use ensemble methods like BalancedRandomForest. The best strategy depends on data size and imbalance ratio.

ML Algorithms Guide