PyTorch Cheat Sheet

Introduction

PyTorch is an open-source deep learning framework developed by Facebook AI Research (FAIR). Its core design philosophy is the dynamic computation graph (define-by-run), which allows flexible modification of network structure at runtime, greatly simplifying debugging and experimentation. PyTorch has become one of the most popular deep learning frameworks in both academic research and industrial deployment, widely used in computer vision, NLP, and reinforcement learning.

Installation

pip (CPU)

pip install torch torchvision torchaudio

pip (CUDA 12.1)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

conda

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Verify Installation

import torch
print(torch.__version__)
print(torch.cuda.is_available())

Tensors

Creating Tensors

import torch

x = torch.tensor([1, 2, 3])              # From list
z = torch.zeros(3, 4)                     # All zeros
o = torch.ones(2, 3)                      # All ones
r = torch.rand(2, 3)                      # Uniform [0,1)
n = torch.randn(2, 3)                     # Standard normal
a = torch.arange(0, 10, 2)                # Range: [0,2,4,6,8]
l = torch.linspace(0, 1, steps=5)         # Evenly spaced
e = torch.eye(3)                           # Identity matrix
f = torch.full((2, 3), 7)                 # Fill with constant
t = torch.empty(3, 3)                      # Uninitialized

Tensor Operations

x = torch.randn(2, 3, 4)

x.reshape(6, 4)           # Change shape
x.view(6, 4)              # Change shape (contiguous)
x.squeeze()                # Remove dims of size 1
x.unsqueeze(0)             # Add dim at position 0
x.permute(2, 0, 1)         # Reorder dimensions
x.transpose(0, 1)          # Transpose two dims
x.contiguous()             # Make contiguous in memory
x.flatten()                # Flatten to 1D

# Concatenation
torch.cat([a, b], dim=0)   # Concatenate along dim
torch.stack([a, b], dim=0)  # Stack along new dim

Math Operations

a = torch.randn(3, 4)
b = torch.randn(4, 5)

torch.matmul(a, b)         # Matrix multiplication  (3,5)
a @ b                       # Same as above
a.sum()                     # Sum all elements
a.sum(dim=1)                # Sum along dim 1
a.mean()                    # Mean
a.max()                     # Max value
a.min()                     # Min value
a.argmax(dim=1)             # Index of max
a.clamp(min=0)              # Clamp (ReLU effect)
torch.abs(a)                # Absolute value
torch.sqrt(a.abs())         # Square root

GPU Transfer & Gradients

# GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.randn(3, 3).to(device)   # Move to GPU
x_cpu = x.cpu()                      # Move back to CPU
x_np = x.cpu().numpy()               # Convert to NumPy

# Gradient Tracking
x = torch.randn(3, requires_grad=True)
y = (x ** 2).sum()
y.backward()                          # Backpropagation
print(x.grad)                         # dy/dx = 2x

with torch.no_grad():                 # Disable gradients (inference)
    pred = model(input_data)

Neural Network Modules (nn.Module)

Common Layers

import torch.nn as nn

nn.Linear(in_features=784, out_features=256)  # Fully connected
nn.Conv2d(in_channels=3, out_channels=16,
          kernel_size=3, stride=1, padding=1)  # 2D convolution
nn.LSTM(input_size=128, hidden_size=256,
        num_layers=2, batch_first=True)        # LSTM
nn.Embedding(num_embeddings=10000,
             embedding_dim=300)                # Word embedding
nn.ConvTranspose2d(16, 3, kernel_size=3)       # Transposed conv

Activations & Regularization

nn.ReLU()                    # max(0, x)
nn.Sigmoid()                 # 1 / (1 + exp(-x))
nn.Tanh()                    # tanh(x)
nn.Softmax(dim=1)            # Softmax normalization
nn.LeakyReLU(0.01)           # Leaky ReLU
nn.GELU()                    # Gaussian Error Linear Unit

nn.Dropout(p=0.5)            # Random dropout during training
nn.BatchNorm2d(num_features=16)  # Batch normalization
nn.LayerNorm(normalized_shape=256)  # Layer normalization

nn.Sequential

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

Custom Model (nn.Module)

class MyModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = MyModel(784, 256, 10)
print(model)                     # View architecture
sum(p.numel() for p in model.parameters())  # Total params

Training Loop

DataLoader Setup

from torch.utils.data import DataLoader, TensorDataset
import torchvision.transforms as transforms
from torchvision.datasets import MNIST

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = MNIST(root='./data', train=True,
                      download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64,
                          shuffle=True, num_workers=4)

# Custom dataset
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

Complete Training Loop

model = MyModel(784, 256, 10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)   # flatten

        optimizer.zero_grad()                 # Zero gradients
        output = model(data)                  # Forward pass
        loss = criterion(output, target)      # Compute loss
        loss.backward()                       # Backward pass
        optimizer.step()                      # Update params

        running_loss += loss.item()
    print(f"Epoch {epoch+1}/{num_epochs}, "
          f"Loss: {running_loss/len(train_loader):.4f}")

Validation Loop

model.eval()
correct = 0
total = 0
with torch.no_grad():
    for data, target in val_loader:
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)
        output = model(data)
        _, predicted = torch.max(output, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

accuracy = 100 * correct / total
print(f"Validation Accuracy: {accuracy:.2f}%")

Save & Load Model

# Save (recommended: state_dict only)
torch.save(model.state_dict(), 'model_weights.pth')

# Load
model = MyModel(784, 256, 10)
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

# Save entire model (with architecture)
torch.save(model, 'full_model.pth')
model = torch.load('full_model.pth')

# Save checkpoint (resume training)
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, 'checkpoint.pth')

Common Patterns

Image Classification (CNN)

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Text Classification (Embedding + LSTM)

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim,
                            batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        x = self.embedding(x)              # (batch, seq, embed)
        _, (h, _) = self.lstm(x)            # h: (2, batch, hidden)
        h = torch.cat([h[0], h[1]], dim=1)  # (batch, hidden*2)
        return self.fc(h)

Transfer Learning

import torchvision.models as models

# Load pretrained model
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Replace final layer
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only train new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

Learning Rate Scheduling

from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, ReduceLROnPlateau

# Decay by 0.1 every 10 epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=50)

# Reduce lr when metric plateaus
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5)

# Call in training loop
for epoch in range(num_epochs):
    train(...)
    val_loss = validate(...)
    scheduler.step()          # StepLR / CosineAnnealing
    # scheduler.step(val_loss)  # ReduceLROnPlateau

Early Stopping

best_loss = float('inf')
patience = 5
counter = 0

for epoch in range(num_epochs):
    val_loss = validate(model, val_loader)

    if val_loss < best_loss:
        best_loss = val_loss
        counter = 0
        torch.save(model.state_dict(), 'best_model.pth')
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping triggered")
            break

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in train_loader:
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()

    with autocast():                    # Auto mixed precision
        output = model(data)
        loss = criterion(output, target)

    scaler.scale(loss).backward()       # Scale gradients
    scaler.step(optimizer)
    scaler.update()

Loss Functions

Loss Function Use Case Code
CrossEntropyLoss Multi-class classification nn.CrossEntropyLoss()
MSELoss Regression nn.MSELoss()
BCEWithLogitsLoss Binary classification nn.BCEWithLogitsLoss()
L1Loss Regression (MAE) nn.L1Loss()
NLLLoss With LogSoftmax nn.NLLLoss()
SmoothL1Loss Robust regression nn.SmoothL1Loss()
CosineEmbeddingLoss Similarity learning nn.CosineEmbeddingLoss()

Optimizers

Optimizer Best For Code
SGD Basic optimization, better with momentum optim.SGD(params, lr=0.01, momentum=0.9)
Adam General default choice optim.Adam(params, lr=1e-3)
AdamW Adam with decoupled weight decay (Transformers) optim.AdamW(params, lr=1e-3, weight_decay=0.01)
RMSprop RNNs / non-stationary objectives optim.RMSprop(params, lr=1e-3)

Debugging Tips

1. Shape Mismatch

# Print shapes in forward()
def forward(self, x):
    print(f"Input: {x.shape}")
    x = self.conv1(x)
    print(f"After conv1: {x.shape}")
    return x

RuntimeError: mat1 and mat2 shapes -- check if Linear in_features matches previous layer output.

2. Exploding / Vanishing Gradients

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Check gradients
for name, p in model.named_parameters():
    if p.grad is not None:
        print(f"{name}: grad norm = {p.grad.norm():.4f}")

3. Loss is NaN

Common causes: learning rate too high, log(0), division by zero. Lower lr, add epsilon to log inputs: torch.log(x + 1e-8).

4. CUDA Out of Memory

# Reduce batch size, or use gradient accumulation
accumulation_steps = 4
for i, (data, target) in enumerate(train_loader):
    loss = criterion(model(data), target)
    loss = loss / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Clear cache
torch.cuda.empty_cache()

5. Model Not Learning (Loss Stagnant)

Checklist: (1) Confirm optimizer.zero_grad() is called; (2) Confirm loss.backward() and optimizer.step() are called; (3) Ensure data labels are correct; (4) Overfit on a small subset first to verify model capacity.

FAQ

What is the difference between PyTorch and TensorFlow?

PyTorch uses dynamic computation graphs (define-by-run), making code more Pythonic and easier to debug. TensorFlow 2.x narrowed the gap with eager execution, but PyTorch is more popular in academia, while TensorFlow has a more mature production ecosystem (TFLite, TF Serving).

What is the difference between model.train() and model.eval()?

model.train() enables training-mode behavior for Dropout and BatchNorm. model.eval() disables Dropout and uses running mean/variance for BatchNorm. For inference, always call model.eval() along with torch.no_grad() to save memory.

When should I use .view() vs .reshape()?

.view() requires the tensor to be contiguous in memory and will raise an error otherwise. .reshape() returns a view when possible, and copies data otherwise. Prefer .reshape() by default unless you need to ensure no data copy occurs.

How do I choose the right learning rate?

Typically start with 1e-3 for Adam, 0.01 to 0.1 for SGD. Use the lr finder technique: gradually increase lr from a tiny value and observe loss changes, then pick the region with fastest descent. For fine-tuning in transfer learning, use smaller lr (1e-4 to 1e-5).

How do I train on multiple GPUs?

The simplest approach is nn.DataParallel (single node, multi-GPU), but DistributedDataParallel (DDP) is recommended for better communication efficiency. Launch DDP training with torchrun --nproc_per_node=4 train.py. For very large models, consider FSDP (Fully Sharded Data Parallel).