PyTorch Cheat Sheet
Introduction
PyTorch is an open-source deep learning framework developed by Facebook AI Research (FAIR). Its core design philosophy is the dynamic computation graph (define-by-run), which allows flexible modification of network structure at runtime, greatly simplifying debugging and experimentation. PyTorch has become one of the most popular deep learning frameworks in both academic research and industrial deployment, widely used in computer vision, NLP, and reinforcement learning.
Installation
pip (CPU)
pip install torch torchvision torchaudio
pip (CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
conda
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
Verify Installation
import torch print(torch.__version__) print(torch.cuda.is_available())
Tensors
Creating Tensors
import torch x = torch.tensor([1, 2, 3]) # From list z = torch.zeros(3, 4) # All zeros o = torch.ones(2, 3) # All ones r = torch.rand(2, 3) # Uniform [0,1) n = torch.randn(2, 3) # Standard normal a = torch.arange(0, 10, 2) # Range: [0,2,4,6,8] l = torch.linspace(0, 1, steps=5) # Evenly spaced e = torch.eye(3) # Identity matrix f = torch.full((2, 3), 7) # Fill with constant t = torch.empty(3, 3) # Uninitialized
Tensor Operations
x = torch.randn(2, 3, 4) x.reshape(6, 4) # Change shape x.view(6, 4) # Change shape (contiguous) x.squeeze() # Remove dims of size 1 x.unsqueeze(0) # Add dim at position 0 x.permute(2, 0, 1) # Reorder dimensions x.transpose(0, 1) # Transpose two dims x.contiguous() # Make contiguous in memory x.flatten() # Flatten to 1D # Concatenation torch.cat([a, b], dim=0) # Concatenate along dim torch.stack([a, b], dim=0) # Stack along new dim
Math Operations
a = torch.randn(3, 4) b = torch.randn(4, 5) torch.matmul(a, b) # Matrix multiplication (3,5) a @ b # Same as above a.sum() # Sum all elements a.sum(dim=1) # Sum along dim 1 a.mean() # Mean a.max() # Max value a.min() # Min value a.argmax(dim=1) # Index of max a.clamp(min=0) # Clamp (ReLU effect) torch.abs(a) # Absolute value torch.sqrt(a.abs()) # Square root
GPU Transfer & Gradients
# GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.randn(3, 3).to(device) # Move to GPU
x_cpu = x.cpu() # Move back to CPU
x_np = x.cpu().numpy() # Convert to NumPy
# Gradient Tracking
x = torch.randn(3, requires_grad=True)
y = (x ** 2).sum()
y.backward() # Backpropagation
print(x.grad) # dy/dx = 2x
with torch.no_grad(): # Disable gradients (inference)
pred = model(input_data)
Neural Network Modules (nn.Module)
Common Layers
import torch.nn as nn
nn.Linear(in_features=784, out_features=256) # Fully connected
nn.Conv2d(in_channels=3, out_channels=16,
kernel_size=3, stride=1, padding=1) # 2D convolution
nn.LSTM(input_size=128, hidden_size=256,
num_layers=2, batch_first=True) # LSTM
nn.Embedding(num_embeddings=10000,
embedding_dim=300) # Word embedding
nn.ConvTranspose2d(16, 3, kernel_size=3) # Transposed conv
Activations & Regularization
nn.ReLU() # max(0, x) nn.Sigmoid() # 1 / (1 + exp(-x)) nn.Tanh() # tanh(x) nn.Softmax(dim=1) # Softmax normalization nn.LeakyReLU(0.01) # Leaky ReLU nn.GELU() # Gaussian Error Linear Unit nn.Dropout(p=0.5) # Random dropout during training nn.BatchNorm2d(num_features=16) # Batch normalization nn.LayerNorm(normalized_shape=256) # Layer normalization
nn.Sequential
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
Custom Model (nn.Module)
class MyModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.3)
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
return x
model = MyModel(784, 256, 10)
print(model) # View architecture
sum(p.numel() for p in model.parameters()) # Total params
Training Loop
DataLoader Setup
from torch.utils.data import DataLoader, TensorDataset
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
train_dataset = MNIST(root='./data', train=True,
download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64,
shuffle=True, num_workers=4)
# Custom dataset
class MyDataset(torch.utils.data.Dataset):
def __init__(self, X, y):
self.X = X
self.y = y
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
Complete Training Loop
model = MyModel(784, 256, 10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
num_epochs = 10
for epoch in range(num_epochs):
model.train()
running_loss = 0.0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
data = data.view(data.size(0), -1) # flatten
optimizer.zero_grad() # Zero gradients
output = model(data) # Forward pass
loss = criterion(output, target) # Compute loss
loss.backward() # Backward pass
optimizer.step() # Update params
running_loss += loss.item()
print(f"Epoch {epoch+1}/{num_epochs}, "
f"Loss: {running_loss/len(train_loader):.4f}")
Validation Loop
model.eval()
correct = 0
total = 0
with torch.no_grad():
for data, target in val_loader:
data, target = data.to(device), target.to(device)
data = data.view(data.size(0), -1)
output = model(data)
_, predicted = torch.max(output, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
accuracy = 100 * correct / total
print(f"Validation Accuracy: {accuracy:.2f}%")
Save & Load Model
# Save (recommended: state_dict only)
torch.save(model.state_dict(), 'model_weights.pth')
# Load
model = MyModel(784, 256, 10)
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()
# Save entire model (with architecture)
torch.save(model, 'full_model.pth')
model = torch.load('full_model.pth')
# Save checkpoint (resume training)
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, 'checkpoint.pth')
Common Patterns
Image Classification (CNN)
class CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 8 * 8, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, num_classes),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
Text Classification (Embedding + LSTM)
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim,
batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
x = self.embedding(x) # (batch, seq, embed)
_, (h, _) = self.lstm(x) # h: (2, batch, hidden)
h = torch.cat([h[0], h[1]], dim=1) # (batch, hidden*2)
return self.fc(h)
Transfer Learning
import torchvision.models as models
# Load pretrained model
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Replace final layer
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only train new layer
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
Learning Rate Scheduling
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, ReduceLROnPlateau
# Decay by 0.1 every 10 epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=50)
# Reduce lr when metric plateaus
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5)
# Call in training loop
for epoch in range(num_epochs):
train(...)
val_loss = validate(...)
scheduler.step() # StepLR / CosineAnnealing
# scheduler.step(val_loss) # ReduceLROnPlateau
Early Stopping
best_loss = float('inf')
patience = 5
counter = 0
for epoch in range(num_epochs):
val_loss = validate(model, val_loader)
if val_loss < best_loss:
best_loss = val_loss
counter = 0
torch.save(model.state_dict(), 'best_model.pth')
else:
counter += 1
if counter >= patience:
print("Early stopping triggered")
break
Mixed Precision Training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for data, target in train_loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
with autocast(): # Auto mixed precision
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward() # Scale gradients
scaler.step(optimizer)
scaler.update()
Loss Functions
| Loss Function | Use Case | Code |
|---|---|---|
| CrossEntropyLoss | Multi-class classification | nn.CrossEntropyLoss() |
| MSELoss | Regression | nn.MSELoss() |
| BCEWithLogitsLoss | Binary classification | nn.BCEWithLogitsLoss() |
| L1Loss | Regression (MAE) | nn.L1Loss() |
| NLLLoss | With LogSoftmax | nn.NLLLoss() |
| SmoothL1Loss | Robust regression | nn.SmoothL1Loss() |
| CosineEmbeddingLoss | Similarity learning | nn.CosineEmbeddingLoss() |
Optimizers
| Optimizer | Best For | Code |
|---|---|---|
| SGD | Basic optimization, better with momentum | optim.SGD(params, lr=0.01, momentum=0.9) |
| Adam | General default choice | optim.Adam(params, lr=1e-3) |
| AdamW | Adam with decoupled weight decay (Transformers) | optim.AdamW(params, lr=1e-3, weight_decay=0.01) |
| RMSprop | RNNs / non-stationary objectives | optim.RMSprop(params, lr=1e-3) |
Debugging Tips
1. Shape Mismatch
# Print shapes in forward()
def forward(self, x):
print(f"Input: {x.shape}")
x = self.conv1(x)
print(f"After conv1: {x.shape}")
return x
RuntimeError: mat1 and mat2 shapes -- check if Linear in_features matches previous layer output.
2. Exploding / Vanishing Gradients
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Check gradients
for name, p in model.named_parameters():
if p.grad is not None:
print(f"{name}: grad norm = {p.grad.norm():.4f}")
3. Loss is NaN
Common causes: learning rate too high, log(0), division by zero. Lower lr, add epsilon to log inputs: torch.log(x + 1e-8).
4. CUDA Out of Memory
# Reduce batch size, or use gradient accumulation
accumulation_steps = 4
for i, (data, target) in enumerate(train_loader):
loss = criterion(model(data), target)
loss = loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# Clear cache
torch.cuda.empty_cache()
5. Model Not Learning (Loss Stagnant)
Checklist: (1) Confirm optimizer.zero_grad() is called; (2) Confirm loss.backward() and optimizer.step() are called; (3) Ensure data labels are correct; (4) Overfit on a small subset first to verify model capacity.
FAQ
What is the difference between PyTorch and TensorFlow?
PyTorch uses dynamic computation graphs (define-by-run), making code more Pythonic and easier to debug. TensorFlow 2.x narrowed the gap with eager execution, but PyTorch is more popular in academia, while TensorFlow has a more mature production ecosystem (TFLite, TF Serving).
What is the difference between model.train() and model.eval()?
model.train() enables training-mode behavior for Dropout and BatchNorm. model.eval() disables Dropout and uses running mean/variance for BatchNorm. For inference, always call model.eval() along with torch.no_grad() to save memory.
When should I use .view() vs .reshape()?
.view() requires the tensor to be contiguous in memory and will raise an error otherwise. .reshape() returns a view when possible, and copies data otherwise. Prefer .reshape() by default unless you need to ensure no data copy occurs.
How do I choose the right learning rate?
Typically start with 1e-3 for Adam, 0.01 to 0.1 for SGD. Use the lr finder technique: gradually increase lr from a tiny value and observe loss changes, then pick the region with fastest descent. For fine-tuning in transfer learning, use smaller lr (1e-4 to 1e-5).
How do I train on multiple GPUs?
The simplest approach is nn.DataParallel (single node, multi-GPU), but DistributedDataParallel (DDP) is recommended for better communication efficiency. Launch DDP training with torchrun --nproc_per_node=4 train.py. For very large models, consider FSDP (Fully Sharded Data Parallel).