Prefill 与 Decode

推理两阶段的计算特征差异（compute-bound vs memory-bound）及 PD 分离动机

核心要点：

Prefill 一次 forward 全段 prompt, compute-bound (AI ~ 75-200)

Decode 一次只算 1 token, memory-bound (AI ~ 1)

H100 ridge point 295 FLOP/byte: AI < 295 内存带宽限，> 295 算力限

MFU (prefill 30-45%) vs MBU (decode 50-80%)

Llama-3 8B decode H100 理论上限 209 token/s，实际 125-167

DistServe / Splitwise / Mooncake: PD 物理分离，3-5× 吞吐

Mooncake (FAST 2025 Best Paper) 日处理 > 1000 亿 token

名词定义

本篇共享名词在 8.1 总览已定义（Arithmetic intensity / Roofline / Compute-bound / Memory-bound / PD 分离）。本篇新引入：

名词	定义
Ridge point	Roofline 模型中"算力限" 和"带宽限" 的交点 AI 值，H100 约 295 FLOP/byte
MFU (Model FLOPS Utilization)	实际 throughput × FLOPs/token 与峰值 FLOPS 的比，衡量算力利用
MBU (Memory Bandwidth Utilization)	实际带宽利用 / 峰值带宽，衡量 memory-bound 阶段的效率
TTFT (Time To First Token)	从用户提交到生成第一个 token 的延迟，prefill 决定
TPOT (Time Per Output Token)	生成单个后续 token 的延迟，decode 决定
TBT (Time Between Tokens)	同 TPOT，强调"两个 token 之间的间隔"
Goodput	满足 SLO 约束的有效吞吐，DistServe 提出的关键指标

@tbl-pd-glossary 本篇新引入名词

Prefill 与 Decode 的计算特征为什么完全相反？

核心问题：02-大模型是什么已说明 prefill 和 decode 跑同一条管线但特征不同。具体怎么不同？为什么 prefill compute-bound 而 decode memory-bound?

Prefill 一次性算 T 个 token 的 forward，权重读一次被 T 个 token 摊薄，arithmetic intensity ~ T (compute-bound); decode 每次算 1 个 token，权重读一次只服务 1 个 token, AI ~ 1 (memory-bound)。

Arithmetic intensity 公式

$$\begin{equation} \text{AI} = \frac{\text{FLOPs}}{\text{memory bytes read/written}} \label{eq:pd-ai} \end{equation}$$

高 AI：计算密集，受算力限制
低 AI：内存密集，受带宽限制

Decode 的 arithmetic intensity 推导

Decode 每步：

算 1 个 token 的 forward, FLOPs $\approx 2N$ （N 是参数量）
从 HBM 读所有权重 $\approx 2N$ bytes (BF16)
加上读 KV cache: $\approx 2 L \cdot n_{kv} \cdot d_{head} \cdot s \cdot 2$ bytes

忽略 KV （短 context）：

$$\begin{equation} \text{AI}_{\text{decode}} \approx \frac{2N}{2N} = 1 \text{ FLOP/byte} \label{eq:pd-decode-ai} \end{equation}$$

Llama-2-7B decode B=1 实测 AI ≈ 0.99 ops/byte (LLM-Viewer on A6000)。典型 memory-bound。

Prefill 的 arithmetic intensity

Prefill 一次 forward T token:

权重读一次，被 T 个 token 共享
FLOPs $\approx 2 N \cdot T$ （linear 层） + $O(T^2 \cdot h)$ (attention)
内存读 $\approx 2N$ bytes （权重）

$$\begin{equation} \text{AI}_{\text{prefill}} \approx \frac{2 N T}{2 N} = T \text{ FLOP/byte} \label{eq:pd-prefill-ai} \end{equation}$$

Llama-2-7B T = 2048, B = 1 实测 prefill AI 75-200 ops/byte。长 prompt (T ≥ 4096) 进入 compute-bound。

Roofline model：算力 vs 带宽的交点

核心问题：AI 多大才算"算力限"？跟硬件什么关系？

Roofline 模型：throughput = min(peak FLOPS, memory bandwidth × AI)，交点称为 ridge point, H100 约 295 FLOP/byte。

Williams 2009 Roofline 公式

Williams et al. 2009[1]:

$$\begin{equation} \text{AttainablePerf}(\text{AI}) = \min(\text{PeakFLOPS}, \text{MemBW} \cdot \text{AI}) \label{eq:pd-roofline} \end{equation}$$

AI < ridge point → memory-bound （带宽 × AI 主导）
AI > ridge point → compute-bound （peak FLOPS 主导）

H100 SXM5 数字

维度	值
BF16 peak FLOPS	989 TFLOPS
HBM3 带宽	3.35 TB/s
Ridge point	$989 \times 10^{12} / (3.35 \times 10^{12}) \approx$ 295 FLOP/byte

@tbl-pd-h100 H100 SXM5 硬件参数与 ridge point

Decode AI ≈ 1 << 295: memory-bound, throughput 受 HBM 带宽限制（而非算力）

Prefill AI 75-200 < 295：仍 memory-bound 但接近 ridge；长 prompt (T = 4096+) 越过 ridge 进入 compute-bound

Decode 的理论上限

Llama-3 8B (BF16 16 GB weights), H100 (3.35 TB/s HBM):

$$\begin{equation} \text{Decode 上限} = \frac{3350 \text{ GB/s}}{16 \text{ GB}} \approx 209 \text{ token/s} \quad (B=1) \label{eq:pd-decode-limit} \end{equation}$$

实际 MBU 60-80% → 125-167 token/s。

意义：单 GPU 单 batch 上限就这么多——增加 throughput 只能靠 batching 或更高 HBM 带宽的硬件。

MFU vs MBU：推理效率的两个指标

核心问题：训练用 MFU 衡量算力利用，但 decode 是 memory-bound, MFU 失去意义。怎么衡量？

MFU 适合 compute-bound (prefill), MBU 适合 memory-bound (decode)——两个指标分别诊断。

定义

$$\begin{equation} \text{MFU} = \frac{\text{throughput tokens/s} \cdot \text{FLOPs/token}}{\text{peak FLOPS}} \label{eq:pd-mfu} \end{equation}$$ $$\begin{equation} \text{MBU} = \frac{\text{model size bytes} \cdot \text{token/s}}{\text{peak HBM BW}} \label{eq:pd-mbu} \end{equation}$$

(推理 FLOPs/token $\approx 2N$，与训练 6N 不同，因无 backward)

典型数字

阶段	指标	典型
Prefill	MFU	30-45% （Pope 2022 TPU 极限 76%）
Decode （未优化）	MBU	50-80%
Decode （优化后）	MBU	90-95% (FlashAttention + PagedAttention)
Decode	MFU	5-15% （tensor core 大量闲置）

@tbl-pd-mfu-mbu MFU vs MBU 典型数字

Decode MFU 仅 5-15% 是 PD 分离的根本动机：tensor core 闲置浪费，把 decode 与 prefill 分到不同硬件能各自优化。

Batch 的作用：把 decode 从 memory-bound 拉向 compute-bound

核心问题：Decode AI ≈ 1，但实际部署 batch ≥ 16, AI 怎么变？

Batching 让多请求共享权重读取，AI ≈ B (batch size)，是把 decode 从 memory-bound 拉向 compute-bound 的关键。

Batch 的 AI 推导

Batch = B 时，decode 一步：

算 B 个请求的 forward, FLOPs $\approx 2 N B$
权重读一次，$\approx 2 N$ bytes (KV cache 加 $B \cdot s \cdot \text{KV/token}$)

$$\begin{equation} \text{AI}_{\text{decode, batch B}} \approx \frac{2 N B}{2 N} = B \label{eq:pd-batch-ai} \end{equation}$$

要 AI 达到 ridge point 295, B 需要 ~ 300——实际工业部署受 VRAM 约束常 B = 16-64，仍 memory-bound，但 throughput 已大幅提升。

Pope 2022 §4 实测

Pope et al. 2022 (Google)[2] 给出系统化分析：

B = 1 decode 在 ridge point 左侧，memory-bound
增 batch 显著改善 throughput 但单请求延迟略升
TPU 大 batch decode MFU 可超 30%

PD 分离：物理拆开各自优化

核心问题：Prefill 和 decode 在同一卡上跑互相干扰，长 prompt prefill 让 decode 卡顿。能不能物理分到不同硬件？

DistServe / Splitwise / Mooncake 三条路线把 PD 物理分离，各自选最优并行策略，工业实测 throughput 提升 3-5×。

PD 同卡的痛点

同卡跑 prefill + decode:

大 prefill 占据 GPU 数百毫秒，阻塞在途 decode，导致 TPOT 突刺
两阶段最优并行策略冲突：prefill 偏 TP （高 FLOPS），decode 偏 PP （低 batch）
单一配置无法同时满足 TTFT 和 TPOT 的严格 SLO

DistServe (OSDI 2024)：首篇系统化

DistServe[3] 是 PD 分离的奠基论文：

物理分离 prefill 和 decode 实例，各自独立搜索最优并行策略
KV cache 传输：同机 NVLink 600 GB/s，跨节点 InfiniBand 800 Gbps; OPT-175B 2048 token 传输仅 17.6 ms （< 单 decode step）
实测 vs vLLM:
- Summarization 工作负载 goodput +4.48×
- SLO 收紧 10.2×
- 综合最优可服务 7.4× 更多请求 或实现 12.6× 更严 SLO

Splitwise (ISCA 2024)：异构硬件

Splitwise[4] 利用 prefill / decode 在不同硬件上效率不同：

H100 算力是 A100 的 3.43×, HBM 带宽仅 1.64×
prefill 用 H100 （高 FLOPS 利用率）
decode 用 A100 （更优的带宽/算力比）
KV cache 经 InfiniBand 逐层传输，优化后开销 0.8% of E2E

实测：

相同预算下 2.35× 吞吐
或 1.4× 吞吐 + 20% 更低成本

Mooncake (FAST 2025 Best Paper)：全局 KV pool

Mooncake[5] 是 Kimi (Moonshot) 的生产系统：

全局 KVCache pool (CPU DRAM / SSD), prefill 节点写入，decode 节点读取
Transfer Engine 支持 RDMA 最高 800 Gbps，多 NIC 带宽聚合

生产实测 （23k traces，均值 7955 input tokens）：

SLO 内处理 75% 更多请求
长上下文 (128K) 峰值提升 498%
运行数千节点，每日 > 1000 亿 token

业界框架

vLLM: 2024 年 12 月集成 Mooncake Transfer Engine, 8 种 KV connector, PD 分离 Experimental
NVIDIA Dynamo (GTC 2025): PD 分离纳入核心，兼容 TRT-LLM / vLLM / SGLang，进 NVIDIA AI Enterprise
2025 年起 PD 分离成为超大规模推理"默认方案"

Takeaway

知识点	核心结论
Prefill AI	$\approx T$ FLOP/byte，长 prompt compute-bound
Decode AI	$\approx 1$ FLOP/byte, memory-bound
Roofline 公式	$\min(\text{Peak FLOPS}, \text{MemBW} \cdot \text{AI})$
H100 ridge point	295 FLOP/byte (989 TFLOPS / 3.35 TB/s)
Decode 理论上限	Llama-3 8B H100 = 209 token/s (B=1)
MFU 适用	prefill 30-45%, Pope TPU 76%
MBU 适用	decode 50-95%
Decode MFU 仅 5-15%	tensor core 闲置，PD 分离根本动机
Batch 作用	$\text{AI} \approx B$，拉近 ridge point
DistServe	物理分离 PD, goodput +4.48×, SLO ×10.2
Splitwise	prefill H100 + decode A100, 2.35× 吞吐
Mooncake	全局 KV pool + RDMA 800 Gbps, 128K context +498%，日 1000 亿 token
业界状态	NVIDIA Dynamo / vLLM 集成，2025 起成默认方案

开放问题

PD 分离与 chunked prefill 的最终选择：二者都解 prefill-decode 干扰，是否会合并成统一方案
decode MFU 极限：tensor core 闲置太多，是否有"专门优化 decode 的 GPU 架构"
KV cache 跨节点传输的极限：800 Gbps RDMA 仍可能成为瓶颈，100K+ context 的 KV 跨机时延仍待优化
多硬件异构（Splitwise 路线）是否可持续：A100/H100 混部组合是否能扩展到 B200 / GB200 等新硬件
Reasoning 模型 (o1/R1) 的推理特征：CoT 让 decode 长度暴增，PD 分离收益是否更显著

参考资料

Williams, Waterman, Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. CACM 2009.
Pope et al. Efficiently Scaling Transformer Inference. 2022. https://arxiv.org/abs/2211.05102
Zhong et al. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving. OSDI 2024. https://arxiv.org/abs/2401.09670
Patel et al. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. ISCA 2024. https://arxiv.org/abs/2311.18677
Qin et al. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. FAST 2025 (Best Paper). https://arxiv.org/abs/2407.00079

名词定义​

Prefill 与 Decode 的计算特征为什么完全相反？​

Arithmetic intensity 公式​

Decode 的 arithmetic intensity 推导​

Prefill 的 arithmetic intensity​

Roofline model：算力 vs 带宽的交点​

Williams 2009 Roofline 公式​

H100 SXM5 数字​

Decode 的理论上限​

MFU vs MBU：推理效率的两个指标​

定义​

典型数字​

Batch 的作用：把 decode 从 memory-bound 拉向 compute-bound​

Batch 的 AI 推导​

Pope 2022 §4 实测​

PD 分离：物理拆开各自优化​

PD 同卡的痛点​

DistServe (OSDI 2024)：首篇系统化​

Splitwise (ISCA 2024)：异构硬件​

Mooncake (FAST 2025 Best Paper)：全局 KV pool​

业界框架​

Takeaway​

开放问题​

延伸阅读​

参考资料​