流水并行 (PP)

怎么按层切 stage、P2P 传多少激活、bubble 从哪来又怎么压

核心要点：

PP 把模型按层切成 stage 的基本原理

1F1B 调度让流水线满载的方法

单次 P2P 的消息大小公式

Pipeline Bubble 比例公式与影响因素

减少 Bubble 的三种工程手段

流水并行 (Pipeline Parallelism, PP) 把模型按层切成多个 stage，用 P2P 在相邻 stage 间传激活/梯度，由 Megatron-LM v2 系统化与 TP/DP 的混合方法[1]。通信量小，主要开销来自 pipeline bubble。

PP 怎么切模型，怎么让流水线满载？

PP 把 $L$ 层 Transformer 平均切成 $p$ 个 stage，每个 stage 上一台设备:

Stage 0          Stage 1          Stage 2          Stage 3
[Layers 0-7]  →  [Layers 8-15]  →  [Layers 16-23]  →  [Layers 24-31]
              P2P              P2P               P2P

为了让设备别闲着，把 global batch 切成 $m$ 个 micro-batch 注入流水线。

1F1B (One Forward, One Backward) 调度让流水线进入稳定阶段后，每个设备交替执行一次前向和一次反向：

时间 →
Stage 0: F0 F1 F2 F3 | B3 B2 B1 B0
Stage 1:    F0 F1 F2 F3 | B3 B2 B1 B0
Stage 2:       F0 F1 F2 F3 | B3 B2 B1 B0
Stage 3:          F0 F1 F2 F3 | B3 B2 B1 B0

PP 在哪些时机通信？

PP 通信发生在相邻 stage 边界，模式是线性链式 P2P:

前向：stage $i$ 输出激活值 → P2P Send → stage $i+1$
反向：stage $i+1$ 梯度 → P2P Send → stage $i$

不涉及集合通信原语，stage $i$ 只与 $i \pm 1$ 通信。

单次 P2P 消息有多大？

单次 P2P 消息大小 = 激活值张量大小：

$$\begin{equation} M_{\text{PP}} = b \cdot s \cdot h \cdot \text{dtype\_size} \label{eq:par-pp-comm-volume} \end{equation}$$

跟 TP AllReduce 同量级 (都是 $b \times s \times h$ 张量)，但语义不同：PP 是点对点 P2P, TP 是集合通信。

典型值 ($b=1$, $s=4096$, $h=7168$, BF16)：约 58.7 MB / 次。

PP 通信特征：

特征	值
通信原语	P2P Send / Recv
消息大小	10 ~ 100 MB
通信模式	线性链式 (stage $i \leftrightarrow i \pm 1$)
频率	每 micro-batch 一次（每个 stage 边界）
延迟敏感性	中等（Pipeline bubble 可部分隐藏通信）
推荐拓扑	相邻 stage 高带宽连接，不需要全连接

@tbl-pp-01 PP 通信特征汇总

Pipeline Bubble 是什么？公式怎么算？

Pipeline Bubble 是流水线启动 (warmup) 和排空 (drain) 阶段的设备空闲时间，是 PP 主要开销，不是通信本身。

Bubble 比例公式[1]:

$$\begin{equation} \text{Bubble ratio} = \frac{p - 1}{m + p - 1} \label{eq:par-pp-bubble-ratio} \end{equation}$$

$p$: stage 数（流水线深度）
$m$: micro-batch 数
分子 $p-1$：启动阶段 $p-1$ 个空闲 slot
分母 $m + p - 1$：总时间 slot ($m$ 个稳态 + $p-1$ 个 warmup)

当 $m \gg p$ 时近似为 $(p-1)/m$，但仅在 $m \gg p$ 时成立，误用会高估 bubble:

$p$	$m$	精确值	近似 $(p-1)/m$	相对误差
4	8	27.3%	37.5%	+37%
8	8	46.7%	87.5%	+87%
8	16	30.4%	43.8%	+44%
4	32	8.6%	9.4%	+9%

@tbl-pp-02 精确 Bubble 公式 vs 近似公式对比

怎么把 Bubble 降下来？

三种工程手段，各有代价：

增大 $m$ （更多 micro-batch）: $(p-1)/(m+p-1) \to 0$，代价是内存压力上升（要存更多 activation）
减小 $p$ （更少 stage）：跟使用 PP 的动机相悖
Interleaved 1F1B （交错流水线）[2]：每个 stage 切成 $v$ 个 chunk, Bubble 降为 $\frac{p-1}{vm + p - 1}$，当 $vm \gg p$ 时约 $\frac{p-1}{vm}$

P2P 延迟会直接加宽 bubble：每次 P2P 传输延迟 $\alpha_{\text{P2P}}$ 都加到 bubble 上。相邻 stage 之间应低延迟连接。

Takeaway

知识点	核心结论
PP 切分原理	按层切成 $p$ 个 stage，每 stage 一台设备
调度策略	1F1B 让流水线稳态满载
通信模式	相邻 stage 链式 P2P，非集合通信
单次消息大小	$b \times s \times h \times \text{dtype}$，与 TP 同量级
Bubble 公式	$(p-1)/(m+p-1)$, $m \gg p$ 时近似 $(p-1)/m$
主开销	Pipeline bubble，不是通信本身
降 bubble 手段	增 $m$ / 减 $p$ / Interleaved 1F1B
拓扑要求	相邻 stage 低延迟即可，无需全连接

@tbl-pp-03 PP 核心知识点

参考资料

Narayanan et al., Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, SC 2021. https://arxiv.org/abs/2104.04473
Narayanan et al., Memory-Efficient Pipeline-Parallel DNN Training, ICML 2021. https://arxiv.org/abs/2006.09503

PP 怎么切模型，怎么让流水线满载？​

PP 在哪些时机通信？​

单次 P2P 消息有多大？​

Pipeline Bubble 是什么？公式怎么算？​

怎么把 Bubble 降下来？​

Takeaway​

参考资料​