简介

PyTorch 是 Facebook AI Research (FAIR) 开发的开源深度学习框架。它以动态计算图（define-by-run）为核心设计理念，允许在运行时灵活修改网络结构，极大地简化了调试和实验过程。PyTorch 已成为学术研究和工业部署中最流行的深度学习框架之一，广泛用于计算机视觉、自然语言处理和强化学习等领域。

安装

pip (CPU)

pip install torch torchvision torchaudio

pip (CUDA 12.1)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

conda

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

验证安装

import torch
print(torch.__version__)
print(torch.cuda.is_available())

张量 (Tensors)

创建张量

import torch

x = torch.tensor([1, 2, 3])              # 从列表创建
z = torch.zeros(3, 4)                     # 全零
o = torch.ones(2, 3)                      # 全一
r = torch.rand(2, 3)                      # 均匀分布 [0,1)
n = torch.randn(2, 3)                     # 标准正态分布
a = torch.arange(0, 10, 2)                # 等差序列
l = torch.linspace(0, 1, steps=5)         # 等间距
e = torch.eye(3)                           # 单位矩阵
f = torch.full((2, 3), 7)                 # 填充常数
t = torch.empty(3, 3)                      # 未初始化

张量操作

x = torch.randn(2, 3, 4)

x.reshape(6, 4)           # 改变形状
x.view(6, 4)              # 改变形状（连续内存）
x.squeeze()                # 去掉大小为1的维度
x.unsqueeze(0)             # 在位置0添加维度
x.permute(2, 0, 1)         # 重排维度
x.transpose(0, 1)          # 转置两个维度
x.contiguous()             # 确保连续存储
x.flatten()                # 展平为一维

# 拼接
torch.cat([a, b], dim=0)   # 沿维度拼接
torch.stack([a, b], dim=0)  # 沿新维度堆叠

数学运算

a = torch.randn(3, 4)
b = torch.randn(4, 5)

torch.matmul(a, b)         # 矩阵乘法  (3,5)
a @ b                       # 同上
a.sum()                     # 所有元素求和
a.sum(dim=1)                # 沿维度1求和
a.mean()                    # 均值
a.max()                     # 最大值
a.min()                     # 最小值
a.argmax(dim=1)             # 最大值索引
a.clamp(min=0)              # 裁剪（ReLU效果）
torch.abs(a)                # 绝对值
torch.sqrt(a.abs())         # 平方根

GPU 传输与梯度

# GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.randn(3, 3).to(device)   # 移到 GPU
x_cpu = x.cpu()                      # 移回 CPU
x_np = x.cpu().numpy()               # 转为 NumPy

# 梯度追踪
x = torch.randn(3, requires_grad=True)
y = (x ** 2).sum()
y.backward()                          # 反向传播
print(x.grad)                         # dy/dx = 2x

with torch.no_grad():                 # 禁用梯度（推理时用）
    pred = model(input_data)

神经网络模块 (nn.Module)

常用层

import torch.nn as nn

nn.Linear(in_features=784, out_features=256)  # 全连接层
nn.Conv2d(in_channels=3, out_channels=16,
          kernel_size=3, stride=1, padding=1)  # 二维卷积
nn.LSTM(input_size=128, hidden_size=256,
        num_layers=2, batch_first=True)        # LSTM
nn.Embedding(num_embeddings=10000,
             embedding_dim=300)                # 词嵌入
nn.ConvTranspose2d(16, 3, kernel_size=3)       # 转置卷积

激活函数与正则化

nn.ReLU()                    # max(0, x)
nn.Sigmoid()                 # 1 / (1 + exp(-x))
nn.Tanh()                    # tanh(x)
nn.Softmax(dim=1)            # Softmax 归一化
nn.LeakyReLU(0.01)           # 带泄漏的 ReLU
nn.GELU()                    # 高斯误差线性单元

nn.Dropout(p=0.5)            # 训练时随机丢弃
nn.BatchNorm2d(num_features=16)  # 批归一化
nn.LayerNorm(normalized_shape=256)  # 层归一化

nn.Sequential

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

自定义模型 (nn.Module)

class MyModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = MyModel(784, 256, 10)
print(model)                     # 查看结构
sum(p.numel() for p in model.parameters())  # 参数总数

训练循环

数据加载

from torch.utils.data import DataLoader, TensorDataset
import torchvision.transforms as transforms
from torchvision.datasets import MNIST

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = MNIST(root='./data', train=True,
                      download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64,
                          shuffle=True, num_workers=4)

# 自定义数据集
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

完整训练循环

model = MyModel(784, 256, 10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)   # flatten

        optimizer.zero_grad()                 # 清零梯度
        output = model(data)                  # 前向传播
        loss = criterion(output, target)      # 计算损失
        loss.backward()                       # 反向传播
        optimizer.step()                      # 更新参数

        running_loss += loss.item()
    print(f"Epoch {epoch+1}/{num_epochs}, "
          f"Loss: {running_loss/len(train_loader):.4f}")

验证循环

model.eval()
correct = 0
total = 0
with torch.no_grad():
    for data, target in val_loader:
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)
        output = model(data)
        _, predicted = torch.max(output, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()

accuracy = 100 * correct / total
print(f"Validation Accuracy: {accuracy:.2f}%")

保存与加载模型

# 保存（推荐：仅保存参数）
torch.save(model.state_dict(), 'model_weights.pth')

# 加载
model = MyModel(784, 256, 10)
model.load_state_dict(torch.load('model_weights.pth'))
model.eval()

# 保存完整模型（含结构）
torch.save(model, 'full_model.pth')
model = torch.load('full_model.pth')

# 保存检查点（可恢复训练）
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}, 'checkpoint.pth')

常用模式

图像分类 (CNN)

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

文本分类 (Embedding + LSTM)

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim,
                            batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        x = self.embedding(x)              # (batch, seq, embed)
        _, (h, _) = self.lstm(x)            # h: (2, batch, hidden)
        h = torch.cat([h[0], h[1]], dim=1)  # (batch, hidden*2)
        return self.fc(h)

迁移学习

import torchvision.models as models

# 加载预训练模型
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# 冻结所有参数
for param in model.parameters():
    param.requires_grad = False

# 替换最后一层
model.fc = nn.Linear(model.fc.in_features, num_classes)

# 只训练新层
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

学习率调度

from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, ReduceLROnPlateau

# 每 10 个 epoch 衰减 0.1 倍
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# 余弦退火
scheduler = CosineAnnealingLR(optimizer, T_max=50)

# 当指标停止改善时降低 lr
scheduler = ReduceLROnPlateau(optimizer, mode='min', patience=5)

# 在训练循环中调用
for epoch in range(num_epochs):
    train(...)
    val_loss = validate(...)
    scheduler.step()          # StepLR / CosineAnnealing
    # scheduler.step(val_loss)  # ReduceLROnPlateau

早停 (Early Stopping)

best_loss = float('inf')
patience = 5
counter = 0

for epoch in range(num_epochs):
    val_loss = validate(model, val_loader)

    if val_loss < best_loss:
        best_loss = val_loss
        counter = 0
        torch.save(model.state_dict(), 'best_model.pth')
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping triggered")
            break

混合精度训练

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in train_loader:
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()

    with autocast():                    # 自动混合精度
        output = model(data)
        loss = criterion(output, target)

    scaler.scale(loss).backward()       # 缩放梯度
    scaler.step(optimizer)
    scaler.update()

损失函数

损失函数	适用场景	代码
CrossEntropyLoss	多分类	`nn.CrossEntropyLoss()`
MSELoss	回归	`nn.MSELoss()`
BCEWithLogitsLoss	二分类	`nn.BCEWithLogitsLoss()`
L1Loss	回归（MAE）	`nn.L1Loss()`
NLLLoss	配合 LogSoftmax 使用	`nn.NLLLoss()`
SmoothL1Loss	抗离群值回归	`nn.SmoothL1Loss()`
CosineEmbeddingLoss	相似度学习	`nn.CosineEmbeddingLoss()`

优化器

优化器	适用场景	代码
SGD	基础优化，配合动量更好	`optim.SGD(params, lr=0.01, momentum=0.9)`
Adam	通用默认选择	`optim.Adam(params, lr=1e-3)`
AdamW	带权重衰减的 Adam（Transformer 首选）	`optim.AdamW(params, lr=1e-3, weight_decay=0.01)`
RMSprop	RNN / 非平稳目标	`optim.RMSprop(params, lr=1e-3)`

调试技巧

1. 形状不匹配

# 在 forward() 中打印形状
def forward(self, x):
    print(f"Input: {x.shape}")
    x = self.conv1(x)
    print(f"After conv1: {x.shape}")
    return x

RuntimeError: mat1 and mat2 shapes — 检查 Linear 的 in_features 是否匹配上一层输出。

2. 梯度爆炸 / 消失

# 梯度裁剪
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# 检查梯度
for name, p in model.named_parameters():
    if p.grad is not None:
        print(f"{name}: grad norm = {p.grad.norm():.4f}")

3. Loss 为 NaN

常见原因：学习率过大、log(0)、除以零。降低学习率，对 log 输入加 epsilon：torch.log(x + 1e-8)。

4. CUDA 内存不足

# 减小 batch size，或使用梯度累积
accumulation_steps = 4
for i, (data, target) in enumerate(train_loader):
    loss = criterion(model(data), target)
    loss = loss / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# 清理缓存
torch.cuda.empty_cache()

5. 模型不学习（Loss 不下降）

检查清单：(1) 确认调用了 optimizer.zero_grad()；(2) 确认调用了 loss.backward() 和 optimizer.step()；(3) 确保数据标签正确；(4) 先在小数据集上过拟合以验证模型能力。

FAQ

PyTorch 和 TensorFlow 有什么区别？

PyTorch 使用动态计算图（define-by-run），代码更 Pythonic，调试更容易；TensorFlow 2.x 引入了 eager execution 后差距缩小，但 PyTorch 在学术界更流行，而 TensorFlow 在生产部署（TFLite、TF Serving）方面生态更成熟。

model.train() 和 model.eval() 有什么区别？

model.train() 启用 Dropout 和 BatchNorm 的训练模式行为。model.eval() 禁用 Dropout，BatchNorm 使用运行均值/方差。推理时必须调用 model.eval() 并配合 torch.no_grad() 以节省内存。

什么时候用 .view() 什么时候用 .reshape()？

.view() 要求张量在内存中连续存储，如不满足会报错。.reshape() 在可能时返回 view，否则自动复制数据。推荐默认用 .reshape()，除非你明确需要确保不复制数据。

如何选择合适的学习率？

通常 Adam 优化器从 1e-3 开始，SGD 从 0.01 到 0.1。可以使用 lr finder 技术：从极小值逐步增大 lr 并观察 loss 的变化，选择 loss 下降最快的区域。迁移学习时微调通常使用较小的 lr（如 1e-4 到 1e-5）。

如何在多 GPU 上训练？

最简单的方式是 nn.DataParallel（单机多卡），但更推荐使用 DistributedDataParallel（DDP），它的通信效率更高。使用 torchrun 启动 DDP 训练：torchrun --nproc_per_node=4 train.py。对于大模型还可以使用 FSDP（Fully Sharded Data Parallel）。

PyTorch 速查表

简介

安装

pip (CPU)

pip (CUDA 12.1)

conda

验证安装

张量 (Tensors)

创建张量

张量操作

数学运算

GPU 传输与梯度

神经网络模块 (nn.Module)

常用层

激活函数与正则化

nn.Sequential

自定义模型 (nn.Module)

训练循环

数据加载

完整训练循环

验证循环

保存与加载模型

常用模式

图像分类 (CNN)

文本分类 (Embedding + LSTM)

迁移学习

学习率调度

早停 (Early Stopping)

混合精度训练

损失函数

优化器

调试技巧

1. 形状不匹配

2. 梯度爆炸 / 消失

3. Loss 为 NaN

4. CUDA 内存不足

5. 模型不学习（Loss 不下降）

相关工具

FAQ

PyTorch 速查表

简介

安装

pip (CPU)

pip (CUDA 12.1)

conda

验证安装

张量 (Tensors)

创建张量

张量操作

数学运算

GPU 传输与梯度

神经网络模块 (nn.Module)

常用层

激活函数与正则化

nn.Sequential

自定义模型 (nn.Module)

训练循环

数据加载

完整训练循环

验证循环

保存与加载模型

常用模式

图像分类 (CNN)

文本分类 (Embedding + LSTM)

迁移学习

学习率调度

早停 (Early Stopping)

混合精度训练

损失函数

优化器

调试技巧

1. 形状不匹配

2. 梯度爆炸 / 消失

3. Loss 为 NaN

4. CUDA 内存不足

5. 模型不学习（Loss 不下降）

相关工具

FAQ

相关工具推荐