Block 与堆叠

一个完整 Transformer block 的组装顺序、层数如何选取，以及参数与计算分别集中在哪里

核心要点：

完整 block: attention 子层 + FFN 子层 + 2 个 norm + 2 个残差

Vaswani 原版 post-norm 已淘汰，现代 pre-norm 主流

Block 参数：FFN 2/3, attention 1/3 （代数恒等）

模型层数 $h/L \in [100, 130]$ 是工业经验区间

Kaplan 2020：形状差 40× 仅损 loss <3%，总参数远比形状重要

Tay 2021 DeepNarrow / Depth Delusion 2026 反向论证，仍在演化

名词定义

本篇共享名词在 5.1 总览已定义 (Transformer block / Pre-norm / Post-norm)。本篇新引入：

名词	定义
Aspect ratio （深宽比）	模型 hidden size 与层数的比值 $h / L$，表征模型"深而瘦" 或"浅而宽"
Sandwich-LN	在 pre-norm 基础上，子层输出加一次额外 LN; CogView / Gemma 使用
Parallel block	PaLM 设计：单 LN 同时供给 attention 和 FFN，两路并行计算；训练快但实证收益小
DeepNarrow	Tay 2021 提出的"深而瘦" 策略：在固定计算预算下增 depth 减 width 优于反向

@tbl-block-stacking-glossary 本篇新引入名词

一个完整 Transformer block 长什么样？

核心问题：04 章讲了 attention，本章前两篇讲了 FFN 和归一化，把这些拼起来的"一个完整 block" 长什么样？Vaswani 原版与现代 Llama 在顺序上具体差在哪？

现代 (pre-norm) block 的 forward 函数只有两行：一次 attention 路径，一次 FFN 路径，各自 LN-then-sublayer-then-residual。

Vaswani 原版 (Post-norm)

Vaswani 2017 §3.1 原话："the output of each sub-layer is LayerNorm(x + Sublayer(x))"，即：

$$\begin{align} \mathbf{h} &= \mathrm{LayerNorm}(\mathbf{x} + \mathrm{Attention}(\mathbf{x})) \\ \mathbf{out} &= \mathrm{LayerNorm}(\mathbf{h} + \mathrm{FFN}(\mathbf{h})) \label{eq:block-vaswani} \end{align}$$

残差在内，LayerNorm 在外。深层训练不稳，必须精细 warmup (见 03-归一化与残差)。

GPT-2 起的 Pre-norm （现代主流）

GPT-2 (Radford 2019) 原话："Layer normalization was moved to the input of each sub-block..."

$$\begin{align} \mathbf{h} &= \mathbf{x} + \mathrm{Attention}(\mathrm{LayerNorm}(\mathbf{x})) \\ \mathbf{out} &= \mathbf{h} + \mathrm{FFN}(\mathrm{LayerNorm}(\mathbf{h})) \label{eq:block-pre-norm} \end{align}$$

LayerNorm 在子层前，残差不被截断，训练直接稳。

nanoGPT 的最简实现

Karpathy nanoGPT 把上式翻译成 4 行代码：

class Block(nn.Module):
    def __init__(self, config):
        self.ln_1 = LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))   # attention 路径
        x = x + self.mlp(self.ln_2(x))    # FFN 路径
        return x

顶层模型最末尾还有一个 ln_f (final layer norm)，在所有 block 之后 LM head 之前再做一次归一化——这是 GPT-2 引入的细节，沿用至今。

Llama 3 现代实现

Llama 3 model.py TransformerBlock.forward[1]:

def forward(self, x, start_pos, freqs_cis, mask):
    h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
    out = h + self.feed_forward(self.ffn_norm(h))
    return out

完全相同的结构，只是把 LayerNorm 换成 RMSNorm:

attention_norm 与 ffn_norm 都是 RMSNorm 实例（而非 LayerNorm）
attention 内部含 GQA + RoPE
feed_forward 是 SwiGLU FFN
顶层末尾还有 self.norm （最终 RMSNorm） + self.output (LM head)

Llama vs nanoGPT 差别仅 3 处：RMSNorm 替 LayerNorm / GQA + RoPE 替 MHA + learned PE / SwiGLU 替 GELU FFN。骨架完全一致。

Block 顺序的两个值得一提的变体

主流是 pre-norm，但有两个变体值得点到：

Sandwich-LN (CogView 2021)

Ding et al. CogView 2021[2] 提出在 pre-norm 基础上每个子层输出再加一次 LN：

$$\begin{equation} \mathbf{out} = \mathbf{x} + \mathrm{LN}(\mathrm{Sublayer}(\mathrm{LN}(\mathbf{x}))) \label{eq:block-sandwich} \end{equation}$$

Google Gemma (2024) 采用同等思路，对子层的输入和输出都做 RMSNorm。这种"双 LN"略增稳定性，代价是每 block 多一次归一化。

Parallel block (PaLM 2022)

PaLM (Chowdhery 2022)[3] 让 attention 和 FFN 共享一个 LN, 并行计算两路：

$$\begin{equation} \mathbf{out} = \mathbf{x} + \mathrm{Attention}(\mathrm{LN}(\mathbf{x})) + \mathrm{FFN}(\mathrm{LN}(\mathbf{x})) \label{eq:block-parallel} \end{equation}$$

每 block 只需 1 个 LN，两路可并行算，大规模训练约快 15%。但精度无显著优势，工业仅 PaLM 系采用，Llama / Qwen / DeepSeek 都没跟。

堆叠多少层合适？

核心问题：Block 写完了，模型由 $L$ 个 block 堆起来。$L$ 怎么选？业界有没有 scaling law 指导？

实际工业 LLM 的层数选择高度集中在 $h / L \in [100, 130]$ 的区间，这个区间是十几年实验经验得出的"既不死也不爆"窄带；scaling laws 说总参数远比形状重要，但极端形状仍受惩罚。

实际模型层数表

模型	$L$	$h$	$h / L$
GPT-2 small	12	768	64
GPT-2 large	24	1280	53
GPT-3 175B	96	12288	128
Llama 3 8B	32	4096	128
Llama 3 70B	80	8192	102
Llama 3 405B	126	16384	130
Qwen2.5 72B	80	8192	102
DeepSeek-V3 (MoE)	61	7168	118
Mixtral 8×7B (MoE)	56	4096	73

@tbl-block-layers 主流 LLM 层数与 $h/L$ 比例

两个规律：

$h/L \approx 100-130$ 是 dense LLM 工业经验区间: GPT-3 / Llama / Qwen 全在这里
MoE 模型层数偏少 ($h/L \approx 70-120$)：用 expert 数量补充容量，不需要那么多层

Kaplan 2020：形状次要，总参数主导

Kaplan et al. 2020 Scaling Laws[4] 给出形状的实证：

固定参数量 $N$, depth/width 形状的影响极弱
具体：形状差 40× 时 (L=6, h=4288 vs L=48, h=1600) loss 仅差 < 3%
总参数 $N \gg$ 模型形状

这意味着对于绝大多数选择，改 $L$ 不如改 $N$。但极端 aspect ratio （太深或太浅）仍会下降，这是工业不走极端的原因。

Chinchilla 2022: compute-optimal，形状仍次要

Hoffmann 2022 Chinchilla[5] 主要论证 compute-optimal 时 $N$ 与训练 token 数 $D$ 的关系，没明确规定 depth/width 形状，把它视为次要超参数。

Chinchilla 70B 自己用 $L = 80, h = 8192, h/L = 102$，说明形状在主流"安全区"。

Tay 2021: DeepNarrow 优于 WideShallow

Tay et al. 2021 Scale Efficiently[6] 在 T5 encoder-decoder 上发现：

DeepNarrow 策略优于 WideShallow：同参数预算下增 depth 减 width 更好
比 T5-base 参数减 50% 同等性能，训练快 40%
depth scaling 算子对 Pareto 边界影响更大

Depth Delusion 2026：反向论证

但最新研究[7] 给出反向结果：

宽度应比深度增长快 $2.8\times$：最优 $W^* \sim C^{0.34}$ vs $D^* \sim C^{0.12}$
存在临界深度 $D_{\text{crit}} \sim W^{0.44}$，超过后增层反而升 loss
实测 64L/6.38B 比 32L/6.86B loss 高 0.12 nats （同算力下深的更差）

业界仍在演化，目前最稳的做法是按 $h/L \in [100, 130]$ 主流区间设计，再按硬件并行约束（TP/PP 划分）调整。

参数和计算分布到哪里去了？

核心问题：一个完整 Transformer 模型，总参数和 FLOPs 具体分布到 attention / FFN / 归一化 / embedding / LM head 各部件？哪里是大头？

Block 内部 FFN 是大头（2/3 参数），embedding + LM head 是显著的额外开销，归一化层参数可忽略。

一个 block 内的参数分布（代数恒等）

部件	参数（含 GQA / SwiGLU 等修正）	占比（标准）
Attention ($W_Q, W_K, W_V, W_O$)	$4h^2$ (MHA) 或更少 (GQA)	$4/12 \approx 33\%$
FFN ($W_{\text{up}}, W_{\text{down}}$) 标准	$8h^2$	$8/12 \approx 67\%$
FFN (SwiGLU $W_{\text{gate}}, W_{\text{up}}, W_{\text{down}}$, $h_{\text{ffn}}=8h/3$)	$8h^2$ （设计保持）	$\approx 67\%$
Norm 层 (2 × RMSNorm)	$2h$ （可忽略）	$< 0.1\%$
Block 总	$\approx 12h^2$	100%

@tbl-block-param-distribution 单 block 参数分布 ($h$ 给定)

FFN 占 2/3, attention 占 1/3 是代数恒等，与 $h$ 无关 (见 02-激活与FFN)。

整个模型参数分布

整体模型参数 ≈ $L \cdot 12h^2 + V \cdot h + V \cdot h$ （后两项是 embedding 和 LM head，若 weight tying 则一项）：

block 堆叠总参数：$12 L h^2$
Embedding: $V h$ (Llama 3 8B: $128256 \times 4096 \approx 525M$)
LM head: $V h$ （Llama 不 tie，额外 525M）

例如 Llama 3 8B 实际：

Block: $12 \cdot 32 \cdot 4096^2 \approx 6.4B$
Embedding + LM head: $2 \cdot 525M \approx 1.05B$
总：$\approx 8B$，一致

FLOPs 分布

单 token 前向 FLOPs (训练时一个序列要 $\times T$):

部件	FLOPs per token
Attention 投影 (Q/K/V/O)	$8h^2$
Attention 内积 + softmax × V	$4 T h$
FFN （4× 标准）	$16h^2$
LM head	$2 V h$

@tbl-block-flops-distribution 单 token 前向 FLOPs 分布

短序列 ($T \ll h$) 下 FFN 主导 FLOPs，约 2/3；长序列下 attention 的 $O(T^2)$ 占比上升。

Hoffmann Chinchilla Appendix F 给出典型场景FFN 贡献约 2/3 训练 FLOPs，与上述代数一致。

工程意义

优化 FFN 收益大于优化 attention：量化 / 稀疏 / Flash 等先打 FFN
MoE 把 FFN 替换成稀疏 expert：直接攻击 2/3 计算瓶颈，详见 interconnect/05-LLM并行通信/08-专家并行
长上下文下 attention 比例上升：$T = 128K$ 时 attention 计算变主导，这是 Flash Attention 重要性的根源 (见 04-注意力机制/04-因果掩码)

至此 GPT 主干完整

走完 04 章 + 05 章，GPT 的主干完整：

input text
    │
    ▼
[tokenizer + embedding + position encoding]  ← 03 章
    │
    ▼
┌────────────────────────────────────┐
│  Transformer block × L              │  ← 04 + 05 章
│  ┌──────────────────────────────┐  │
│  │  pre-RMSNorm + attention +   │  │  ← 04 章 attention 子层
│  │  residual                    │  │
│  ├──────────────────────────────┤  │
│  │  pre-RMSNorm + FFN(SwiGLU) + │  │  ← 05-02 + 05-03
│  │  residual                    │  │
│  └──────────────────────────────┘  │  ← 05-04 本篇组装
└────────────────────────────────────┘
    │
    ▼
[final RMSNorm + LM head + sampling]   ← 03-token-embedding + 08-推理

剩下的内容：

06-预训练：在这个骨架上跑 next-token prediction 训练，含训练目标 / 训练循环 / scaling laws
07-微调与对齐: SFT / RLHF / DPO 对齐人类偏好
08-推理: prefill / decode / KV cache / 采样 / 量化

Takeaway

知识点	核心结论
现代 block 结构	pre-RMSNorm → attention → residual → pre-RMSNorm → FFN(SwiGLU) → residual
Vaswani 原版 vs 现代	post-norm → pre-norm, LayerNorm → RMSNorm, 4× FFN → SwiGLU 8/3
nanoGPT vs Llama	骨架完全相同，只是 norm + position + FFN 三处替换
顺序变体	Sandwich-LN （Gemma 用） / Parallel block （PaLM 用），都非主流
Block 参数分布	FFN 2/3, attention 1/3 (代数恒等，与 $h$ 无关)
模型总参数	$\approx 12 L h^2 + 2 V h$
FLOPs 分布	短序列 FFN 2/3，长序列 attention 上升
Kaplan 2020	形状 (aspect ratio) 影响 < 3%，总参数主导
Chinchilla 2022	compute-optimal 重 $N$ 与 $D$，形状次要
Tay 2021	DeepNarrow 优于 WideShallow (T5 encoder-decoder)
Depth Delusion 2026	反向：宽度应比深度增长快 2.8×，存在临界深度
工业经验	$h/L \in [100, 130]$ 是 dense LLM 的稳健区间
MoE 层数	偏少 ($h/L \approx 70-120$), expert 数量补充容量

开放问题

DeepNarrow vs Depth Delusion 哪个对：两条研究指向相反结论，业界主流仍在 $h/L \approx 100-130$ 区间观望
MoE 时层数的最优：DeepSeek-V3 选 61 层 + 256 expert，但 expert 数量与层数的最优关系尚无 scaling law
极深（>200 层）与极浅（<20 层大模型）是否有未被探索的甜点：Kaplan 2020 局限在中等范围，极端形状尚无系统研究
Sandwich-LN / Parallel block 等变体的工业价值：实证收益小但不为零，是否有特定场景值得采用？仍开放
下一代 block 结构：Mamba / SSM / linear attention 等都在挑战 Transformer block，是否有"后 Transformer" 时代仍开放

参考资料

Meta Llama 3 model.py. https://github.com/meta-llama/llama3/blob/main/llama/model.py
Ding et al. CogView: Mastering Text-to-Image Generation via Transformers (Sandwich-LN). 2021. https://arxiv.org/abs/2105.13290
Chowdhery et al. PaLM: Scaling Language Modeling with Pathways (parallel block). 2022. https://arxiv.org/abs/2204.02311
Kaplan et al. Scaling Laws for Neural Language Models. 2020. https://arxiv.org/abs/2001.08361
Hoffmann et al. Training Compute-Optimal Large Language Models. 2022. https://arxiv.org/abs/2203.15556
Tay et al. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers. 2021. https://arxiv.org/abs/2109.10686
The Depth Delusion: Width should grow 2.8× faster than depth. 2026. arXiv preprint.

名词定义​

一个完整 Transformer block 长什么样？​

Vaswani 原版 (Post-norm)​

GPT-2 起的 Pre-norm （现代主流）​

nanoGPT 的最简实现​

Llama 3 现代实现​

Block 顺序的两个值得一提的变体​

Sandwich-LN (CogView 2021)​

Parallel block (PaLM 2022)​

堆叠多少层合适？​

实际模型层数表​

Kaplan 2020：形状次要，总参数主导​

Chinchilla 2022: compute-optimal，形状仍次要​

Tay 2021: DeepNarrow 优于 WideShallow​

Depth Delusion 2026：反向论证​

参数和计算分布到哪里去了？​

一个 block 内的参数分布 （代数恒等）​

整个模型参数分布​

FLOPs 分布​

工程意义​

至此 GPT 主干完整​

Takeaway​

开放问题​

延伸阅读​

参考资料​