Neural Network Architectures
What Are Neural Networks?
Neural networks are a class of machine learning models inspired by biological neurons. They consist of interconnected computational nodes (neurons) that learn patterns in data by adjusting the weights between connections. Since Rosenblatt's Perceptron in 1958, neural networks have gone through multiple waves: backpropagation in the 1980s, AlexNet sparking the deep learning revolution in 2012, the Transformer architecture revolutionizing NLP in 2017, and diffusion models breaking through in generative AI in the 2020s.
Different tasks demand different architectures: CNNs for image recognition, RNNs/LSTMs for sequential data, Transformers for language and multimodal tasks, and GANs or diffusion models for generation. This guide systematically covers the principles, core components, code implementations, and best use cases for each architecture.
Feedforward Neural Network (FNN)
The Feedforward Neural Network is the most fundamental neural network architecture. Data flows in one direction from input layer through one or more hidden layers to the output layer, with no cyclic connections. Each layer's neurons are fully connected to the next layer, with activation functions (e.g., ReLU, Sigmoid) introducing non-linearity.
FNNs are suitable for simple classification and regression tasks on structured data, such as tabular data prediction and credit scoring. When data has spatial structure (images) or temporal characteristics, CNN or RNN should be used instead.
Key Pipeline
PyTorch
import torch
import torch.nn as nn
class FNN(nn.Module):
def __init__(self, in_dim, hidden, out_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, hidden),
nn.ReLU(),
nn.Linear(hidden, out_dim),
)
def forward(self, x):
return self.net(x)
model = FNN(784, 256, 10) # e.g. MNIST digits
Convolutional Neural Network (CNN)
CNNs use convolutional filters (kernels) that slide across input data to automatically extract local features like edges, textures, and shapes. Convolutional layers use weight sharing to drastically reduce parameter count; pooling layers reduce spatial dimensions and enhance translation invariance. By stacking multiple convolutional layers, the network progressively abstracts from low-level features (edges) to high-level semantic features (object parts, full objects).
CNNs are the foundational architecture for computer vision, widely used in image classification, object detection, and semantic segmentation. While Vision Transformers (ViT) have surpassed CNNs on some tasks in recent years, CNNs remain important due to their efficiency and mature tooling.
Core Component Pipeline
Famous CNN Models
| Model | Year | Key Innovation | Layers | Params |
|---|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN, handwriting recognition | 5 | 60K |
| AlexNet | 2012 | ReLU + Dropout + GPU training, sparked DL revolution | 8 | 60M |
| VGG-16 | 2014 | Uniform 3x3 small kernels, deeper networks | 16 | 138M |
| ResNet-50 | 2015 | Residual connections (skip connections), solved vanishing gradient | 50 | 25M |
| EfficientNet | 2019 | Compound scaling (depth/width/resolution) | ~82 | 5-66M |
PyTorch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 8 * 8, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, num_classes),
)
def forward(self, x):
return self.classifier(self.features(x))
RNN / LSTM / GRU
Recurrent Neural Networks (RNN) are designed for sequential data. The hidden state h(t) at each time step depends on both the current input x(t) and the previous hidden state h(t-1), capturing temporal dependencies in sequences. However, standard RNNs suffer from severe vanishing/exploding gradient problems on long sequences.
LSTM (Long Short-Term Memory) introduces three gating mechanisms (forget gate, input gate, output gate) and an independent cell state, effectively solving the long-range dependency problem. GRU (Gated Recurrent Unit) is a simplified variant that merges the forget and input gates into a single "update gate," requiring fewer parameters and training faster, while achieving comparable performance on many tasks.
RNN vs LSTM vs GRU Comparison
| Feature | RNN | LSTM | GRU |
|---|---|---|---|
| Gating | None | 3 gates (forget/input/output) | 2 gates (reset/update) |
| Long-range Deps | Poor (vanishing gradient) | Excellent | Good |
| Parameters | Fewest | Most (4x hidden) | Medium (3x hidden) |
| Training Speed | Fast | Slow | Medium |
| Best For | Short sequences | Long text, speech | Medium sequences, limited resources |
PyTorch
import torch.nn as nn
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden, num_classes):
super().__init__()
self.emb = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden, num_layers=2,
batch_first=True, dropout=0.3)
self.fc = nn.Linear(hidden, num_classes)
def forward(self, x):
x = self.emb(x) # (B, T, E)
_, (h_n, _) = self.lstm(x) # h_n: (2, B, H)
out = self.fc(h_n[-1]) # last layer hidden
return out
Transformer
The Transformer was proposed by Google in the 2017 paper "Attention Is All You Need." It completely eliminates recurrence, relying solely on the self-attention mechanism to model dependencies between any positions in a sequence. This enables highly parallel training, solving the sequential bottleneck of RNNs.
At its core is the Multi-Head Attention mechanism: input is projected into Query, Key, and Value matrices, and scaled dot-product attention computes how much each position should attend to every other position. Positional Encoding injects sequence order information, since self-attention itself is permutation-invariant.
Core Formula
Architecture Components
Transformer Variants
| Variant | Structure | Models | Primary Task | Params |
|---|---|---|---|---|
| Encoder-only | Encoder only, bidirectional attention | BERT, RoBERTa, DeBERTa | Understanding (classification, NER, QA) | 110M-340M |
| Decoder-only | Decoder only, causal (unidirectional) attention | GPT-4, Claude, LLaMA, Gemini | Generation (chat, code, reasoning) | 7B-1.8T |
| Encoder-Decoder | Full enc-dec, cross attention | T5, BART, mBART | Translation, summarization, Seq2Seq | 220M-11B |
| Vision Transformer | Image patches + Transformer encoder | ViT, DeiT, Swin Transformer | Image classification, detection | 86M-632M |
PyTorch — Self-Attention
import torch
import torch.nn as nn
import math
class SelfAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
B, T, C = x.shape
q = self.W_q(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
k = self.W_k(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
v = self.W_v(x).view(B, T, self.n_heads, self.d_k).transpose(1,2)
# Scaled dot-product attention
attn = (q @ k.transpose(-2,-1)) / math.sqrt(self.d_k)
attn = torch.softmax(attn, dim=-1)
out = (attn @ v).transpose(1,2).contiguous().view(B, T, C)
return self.W_o(out)
# Usage: attn = SelfAttention(d_model=512, n_heads=8)
GAN (Generative Adversarial Network)
Proposed by Ian Goodfellow in 2014, GANs consist of two competing networks: a Generator that attempts to produce realistic data from random noise, and a Discriminator that tries to distinguish real data from generated data. They are trained through a minimax game, until the generator produces samples indistinguishable from real data.
Training instability is the main challenge of GANs, with issues like mode collapse and oscillation. Techniques such as WGAN and Spectral Normalization have partially addressed these problems.
GAN Variants
| Variant | Key Innovation | Application |
|---|---|---|
| DCGAN | Convolutional layers replace FC layers | Image generation |
| StyleGAN (1/2/3) | Style mapping network, layer-wise control | High-quality face generation |
| CycleGAN | Unpaired image-to-image translation (cycle consistency) | Style transfer, season conversion |
| Pix2Pix | Paired conditional image generation | Image translation (sketch to photo) |
| WGAN | Wasserstein distance replaces JS divergence | More stable training |
PyTorch — Simple GAN
import torch.nn as nn
class Generator(nn.Module):
def __init__(self, z_dim=100, img_dim=784):
super().__init__()
self.net = nn.Sequential(
nn.Linear(z_dim, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 512), nn.LeakyReLU(0.2),
nn.Linear(512, img_dim), nn.Tanh(),
)
def forward(self, z):
return self.net(z)
class Discriminator(nn.Module):
def __init__(self, img_dim=784):
super().__init__()
self.net = nn.Sequential(
nn.Linear(img_dim, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 1), nn.Sigmoid(),
)
def forward(self, x):
return self.net(x)
Diffusion Models
Diffusion models are a class of probabilistic generative models based on two processes: the forward process gradually adds Gaussian noise to data until it becomes pure noise; the reverse process learns to progressively denoise to recover the original data. By parameterizing a denoising network (typically a U-Net), the model learns to predict and remove noise at each step.
Compared to GANs, diffusion models offer more stable training, higher generation quality, and better diversity, but slower inference (requiring multiple denoising steps). Sampling strategies like DDPM and DDIM can accelerate inference.
Representative Models
| Model | Architecture | Features |
|---|---|---|
| Stable Diffusion | Latent Diffusion (LDM) + U-Net + CLIP | Open source, text-to-image, fine-tunable |
| DALL-E 2/3 | CLIP + diffusion prior + decoder | High-quality text-to-image |
| Midjourney | Proprietary diffusion architecture | Strong artistic style |
| Sora | Diffusion Transformer (DiT) | Text-to-video generation |
Architecture Comparison
| Architecture | Best For | Key Innovation | Params Range | Year |
|---|---|---|---|---|
| FNN | Tabular classification/regression | Fully connected + backpropagation | 1K - 10M | 1986 |
| CNN | Image/vision tasks | Convolution + weight sharing + pooling | 60K - 138M | 1998 |
| RNN/LSTM | Short/medium sequences | Recurrent connections + gating | 100K - 50M | 1997 |
| Transformer | NLP, multimodal, long sequences | Self-attention + positional encoding | 110M - 1.8T | 2017 |
| GAN | Image generation/style transfer | Generator-discriminator adversarial training | 1M - 200M | 2014 |
| Diffusion | High-quality image/video generation | Iterative denoising + probabilistic modeling | 100M - 8B | 2020 |
Architecture Selection Guide
Choose the most suitable neural network architecture based on your task:
| Task | Recommended Architecture | Recommended Models |
|---|---|---|
| Image Classification | CNN / ViT | ResNet, EfficientNet, ViT, ConvNeXt |
| Object Detection | CNN | YOLOv8, Faster R-CNN, DETR |
| Semantic Segmentation | CNN / Transformer | U-Net, SegFormer, Mask2Former |
| Text Classification / NLU | Transformer (Encoder) | BERT, RoBERTa, DeBERTa |
| Text Generation / Chat | Transformer (Decoder) | GPT-4, Claude, LLaMA 3, Gemini |
| Machine Translation | Transformer (Enc-Dec) | T5, mBART, NLLB |
| Time Series Forecasting | LSTM / Transformer | LSTM, Temporal Fusion Transformer, PatchTST |
| Image Generation | Diffusion / GAN | Stable Diffusion, DALL-E 3, StyleGAN |
| Video Generation | Diffusion Transformer | Sora, Runway Gen-3, Kling |
| Multimodal Understanding | Vision Transformer | CLIP, LLaVA, GPT-4V, Gemini |
| Speech Recognition | Transformer | Whisper, Wav2Vec 2.0, Conformer |
| Recommendation System | FNN / Transformer | DeepFM, DLRM, SASRec |
Related Resources
Dive deeper into machine learning and deep learning frameworks:
Frequently Asked Questions
Is CNN or Transformer better for image tasks?
Both have advantages. CNNs perform better on smaller datasets due to their inductive biases (locality, translation invariance) providing good priors; they are also more efficient for training and inference. Vision Transformers (ViT) typically outperform CNNs on large-scale datasets (ImageNet-21K, JFT-300M) because self-attention captures global dependencies. The current trend is hybrid architectures (e.g., ConvNeXt borrows Transformer design principles while using convolutions) and fine-tuning pretrained ViTs on smaller datasets.
Why did Transformers replace RNNs as the dominant NLP architecture?
Three main reasons: (1) Parallelization -- RNNs must process sequentially, while Transformer self-attention allows all positions to be computed simultaneously, providing orders-of-magnitude speedup in training; (2) Long-range dependencies -- self-attention directly models relationships between any two positions regardless of distance (theoretically), whereas LSTM still forgets on extremely long sequences despite gating; (3) Scalability -- the Transformer architecture continues improving performance as parameters scale to hundreds of billions (scaling law), something RNNs cannot match.
Which is better: GAN or Diffusion Models?
Diffusion models have comprehensively surpassed GANs in generation quality and diversity, especially for text-conditioned generation. However, GANs still have an advantage in inference speed (single forward pass vs. 20-50 denoising steps for diffusion). For real-time applications or resource-constrained scenarios, GANs may still be preferable. That said, distillation techniques (Consistency Models, LCM) are drastically reducing diffusion model inference steps, approaching real-time. In 2026, if speed is not a constraint, diffusion models are the preferred choice for image generation.
Which architecture should beginners learn first?
Recommended learning order: (1) Start with FNN (fully connected networks) to understand forward propagation, backpropagation, and gradient descent fundamentals; (2) Then learn CNN to understand convolution, pooling, and feature extraction -- practice with MNIST/CIFAR-10; (3) Learn basic RNN/LSTM concepts (even though Transformers dominate, understanding recurrent structure helps grasp the essence of sequence modeling); (4) Finally, dive deep into Transformers, the most important architecture today. PyTorch is recommended as the learning framework due to its intuitive code and easier debugging.
How do I choose the right model size (parameter count)?
Model size depends on three factors: (1) Data volume -- according to Chinchilla scaling laws, optimal training tokens should be approximately 20x the parameter count; insufficient data can lead to overfitting with larger models; (2) Compute resources -- a 7B parameter model requires ~14GB VRAM for inference (FP16), 70B requires ~140GB needing multi-GPU parallelism; (3) Task complexity -- simple classification tasks work fine with BERT-base (110M), while complex reasoning may need 70B+ models. The practical recommendation is to start small, evaluate on a validation set, and scale up only if performance is insufficient.