推理部署模式

Prefill/Decode 不对称怎么催生共驻与解耦两种架构、KV cache 迁移带宽从哪来

核心要点：

Prefill 与 Decode 的计算/通信不对称

共驻 vs 解耦两种部署架构的选型依据

同构 vs 异构 KV cache 的迁移协议差异

长上下文部署的全栈带宽要求

Speculative Decoding 与 Paged KV 对通信的影响

动态 CP 调度（NanoCP）：per-request CP 度 + MoE/KV 解耦

LLM 推理与训练在通信上的本质差异来自 Prefill/Decode 两阶段不对称：prefill 是计算密集批量处理（大消息，beta 主导），decode 是带宽敏感细粒度增量（小消息，alpha 主导）。这种不对称催生了共驻与解耦两种部署架构。

Prefill 和 Decode 为什么不对称？

两阶段算力特征完全不同：

维度	Prefill	Decode
输入长度	整个 prompt （可达 1M token）	1 个新 token
Attention 计算量	$O(S^2)$ 或稀疏化 $O(Sk)$	$O(S)$ 读累计 KV
GEMM 形状	大 batch ($S \times d$)	小 batch ($1 \times d$)
算力利用	高 (compute-bound)	低 (memory-bound)
主要瓶颈	计算吞吐 / CP 带宽	KV cache 读取带宽
每 token 延迟敏感性	低（一次性 prefill）	高（用户感知 TPOT）

@tbl-infer-01 Prefill 与 Decode 计算特征不对称

通信压力也分化：

通信类型	Prefill	Decode
TP AllReduce	每层一次，消息大 ($S \cdot d$)	每层一次，消息极小 ($1 \cdot d$)
CP Ring	必须，流量大	KV 装得下时不需要；1M KV 单卡装不下时需切 KV + pass-Q (见 decode 阶段 CP)
EP a2a	大批量 token 同时 dispatch	1 token a2a, alpha 主导
PP P2P	大 activation 传递	小 activation 传递

@tbl-infer-02 两阶段通信压力差异

Decode 阶段所有通信消息都变成小消息，alpha 项（启动延迟）主导，与 prefill 的 beta 项（带宽）主导根本不同。这是部署架构分化的驱动力。

共驻还是解耦，怎么选？

共驻 (co-located)：prefill 和 decode 跑在同一组 GPU。代表：vLLM / SGLang 早期。

GPU 0-7: 处理 batch 中所有请求
  用户请求到达 → 立刻 prefill → 进入 decode 队列 → 持续 decode 直到结束

特征：

资源静态：每 GPU 既要算力又要 KV cache 显存
batch 形状混合：长 prefill + 短 decode 混在同 batch
延迟干扰：长 prefill 占资源时同 batch decode 的 TPOT 被拖慢

解耦 (disaggregated)：prefill 和 decode 跑在独立 GPU 池，prefill 完成后把 KV cache 迁到 decode 池。代表：DistServe[1] / Splitwise[2] / Mooncake[3]。

Prefill Pool (高算力 GPU):
  请求到达 → prefill → 输出首 token + KV cache → 迁移到 decode 池

KV Cache Migration (跨集群 NVLink / RDMA)
  ↓

Decode Pool (高带宽 + 大显存 GPU):
  接收 KV cache → 持续 decode → 输出 token 流

特征：

资源专用：prefill 池堆算力，decode 池堆带宽 / 显存
避免干扰：长 prefill 不阻塞 decode TPOT
新通信负担：KV cache 必须跨集群迁移

选型决策：

因素	偏向共驻	偏向解耦
请求长度	短 (< 4K)	长 (128K+)
Prefill / Decode 时间比	接近 1	悬殊（如 1000× 长 prompt）
集群规模	小（几十 GPU）	大（百卡以上）
GPU 异构性	同构	可异构 (prefill H100, decode L40S)
KV cache 压缩比	弱（迁移代价大）	强（迁移代价可承受）

@tbl-infer-03 共驻 vs 解耦的决策因素

长上下文 + 强 KV 压缩让解耦更划算：cache 小、迁移便宜、prefill 计算压力大。

KV Cache 迁移协议怎么设计？

解耦部署的核心通信负担，协议复杂度取决于 cache 类型同构还是异构。

同构 KV cache (V3 / GQA / MQA / MLA)

所有层 cache 是同种 shape、同种精度的 dense tensor:

协议简单：单一 RDMA write，目标地址按层 offset 排布
带宽利用率高（大消息）
vLLM Disaggregated / Mooncake 默认假设

异构 KV cache （V4 风格混合 attention）

不同层是不同种类 attention, cache 类型不同：

层类型	Cache 内容	精度	Shape 特征
SWA 层	最近 $n_{\text{win}}$ 个未压缩 KV	BF16	与 window 线性
HCA 层	每 $m'$ token 一个 compressed entry （K=V 共用）	RoPE 维 BF16，其余 FP8	与 $S/m'$ 线性
CSA 层	indexer key cache （无 V, dim=128） + main compressed entry （K=V 共用，dim=512）	主路径 RoPE 维 BF16 / 其余 FP8; indexer key FP8	与 $S/m$ 线性

@tbl-infer-04 异构 KV cache 的不同 layout

迁移协议要求：

按层 schema 切分：迁移端预先知道每层类型
异构传输：不同 chunk 大小、不同精度组合
metadata 同步：cache 索引、free list、PagedAttention block 表

实现复杂度比同构高一截，但绝对数据量反而下降 （V4-Pro 1M cache ≈ V3.2 的 10%）。

迁移带宽需求

迁移时间必须小于首 token 生成时间，否则用户感知延迟上升：

$$\begin{equation} T_{\text{mig}} = \frac{M_{\text{KV}}}{B_{\text{interconnect}}} < T_{\text{first-token}} \label{eq:par-infer-mig-budget} \end{equation}$$

V4-Pro 1M context 示例：cache 4 GB，跨集群 InfiniBand NDR 200 GB/s，迁移 20 ms，可被首 token 生成时间（数百 ms）完全 hide。

对互联的具体要求：跨 prefill/decode pool 带宽至少与 decode pool 内部 NVLink 一个数量级，否则迁移成为瓶颈。

动态 CP 调度：NanoCP 怎么把 CP 度做成 per-request？

NanoCP (2026-05) 把 CP 从静态分片推向 request-level 动态调度——每个请求独立决定 CP 度，并把 MoE 通信与 KV cache 解耦[4]。这是解耦架构在 DP-EP 混合 MoE serving 下的进阶：请求长度差异大时，静态 CP 度无法同时均衡 attention 与 MoE 负载。

矛盾在于：attention 延迟随 KV 大小（序列长）增长，MoE 通信延迟随 batch 大小（请求数）增长。现有系统把两者绑定到同一 DP instance，均衡其一必恶化另一个，产生 EP straggler。NanoCP 解耦两个绑定：

绑定	定义	基数
MoE binding	执行请求 MoE dispatch/combine 的 DP instance	恰好 1 个
KV binding	存 KV、算 attention 的 instance 集合	1 到 $k$（CP 度）

@tbl-infer-nanocp-binding NanoCP 的 MoE / KV 绑定解耦

WaterFill 分配：给定请求序列长 $l_r$ 与参与 instance 集合 $P_r$，把 token 优先倒入负载最低的 instance（经典注水法），拉平峰值 KV 负载：

$$\begin{equation} \min \max_{s \in P_r} (K_s + \text{Split}_r[s]) \quad \text{s.t.} \quad \sum_{s \in P_r} \text{Split}_r[s] = l_r \label{eq:par-infer-waterfill} \end{equation}$$

CP 度 $k_r$ 由离线 profiling 的长度-bucket 函数决定：短请求（4K）本地完成（$k=1$），长请求（512K）跨多 instance。峰值仅 1.1% 请求用跨 instance CP，DCP all-to-all 延迟峰值仅 35.5 μs。MoE binding 在 KV binding 集合内重分配到负载最小 instance，零 KV 迁移成本（只更新路由表）[4]。

性能（32×H200, 1T MoE, 32DP-32EP）[4]：最大请求率 +88%~~+227%（1.88–3.27×）、P99 尾延迟 −44%~~−53%、CP 通信开销 −90%（630→60 μs）。CP 的序列切分机制本身见上下文并行，本节聚焦它在 serving 调度层的动态化与 MoE 解耦。

Speculative Decoding 对通信意味着什么？

Speculative Decoding 让 decode 每步并行验证多个候选 token，提升等效吞吐，通信角度让消息略大但仍不到 prefill 量级。

两种方案：

独立 draft model

小模型生成 $k$ 个候选 token，大模型一次性 verify:

每步通信量约 $k \times$ (verify 处理 $k$ token)
实际是 prefill 风格的小 batch, alpha 项被均摊
Verify 失败回退几乎无额外通信成本（仅废弃本地 logits）

Multi-Token Prediction (MTP)

模型本身预测 next-2 token （DeepSeek V3 / V4 用法）：

主模型每层多一个 MTP block，通信路径多一次 TP AllReduce + MoE a2a
消息大小与主路径相同，但 batch 增加 1 token

通信特征对比：

维度	标准 decode	Speculative decode
每步 token 数	1	$1 + k$
TP 消息大小	$1 \cdot d$	$(1+k) \cdot d$
Verify 失败	N/A	部分 token 废弃，但通信已发生
等效 TPOT	$T_{\text{step}}$	$T_{\text{step}} / \mathbb{E}[\text{accepted}]$

@tbl-infer-05 Speculative decoding 的通信特征

关键：消息略大缓解了 alpha 敏感性，但仍达不到 prefill batch, alpha 项依然重要。

Continuous Batching 与 Paged KV 怎么影响通信？

主流推理框架 (vLLM / SGLang) 用 continuous batching：每步动态换出已结束请求，换入新请求，对通信带来两个影响。

Batch 大小动态

batch size 每步可能变化，TP / EP 通信消息大小随之变化。集合通信库必须支持变长消息，且不能为每种 size 单独预热。

Paged KV cache 的内存非连续

PagedAttention 按 block 管理 KV （典型 16 token/block），全局 free list。KV 在物理内存上非连续：

同请求 KV 跨多个不连续 block
迁移 / gather 时需要 scatter-gather 风格 DMA
NVLink P2P / RDMA 必须支持 scatter-gather list (NCCL ncclSend 不支持，需要 NVSHMEM 风格 device-initiated 接口)

@tbl-infer-06 Paged KV cache 对通信的要求

长上下文部署对互联的具体要求？

长上下文 (128K+) 推理部署是"全栈带宽"挑战，单点优化收益有限：

通信负担	关键指标	典型量级
Prefill CP ring	大消息带宽	单层每 ring step 几十 MB
Prefill EP a2a	不规则 + wave-scheduled 小消息	单 wave MB 级
Decode TP AllReduce	小消息延迟	每层 < 1 KB, alpha 主导
Cache migration （解耦）	跨集群高带宽	GB 级一次性
Spec decode verify	中等消息 ($(1+k) \cdot d$)	KB 级

@tbl-infer-07 长上下文推理对互联的具体要求

NVLink （intra-node 大消息） + NVSwitch (CP / EP all-to-all) + RDMA （cross-pool 迁移） + HBM (decode read) 四套带宽必须同时到位。

Takeaway

知识点	核心结论
两阶段不对称	Prefill compute-bound 大消息，Decode memory-bound 小消息
Alpha 主导	Decode 通信消息小，alpha （启动延迟）主导，与训练不同
部署架构	共驻 vs 解耦；长上下文 + 强 KV 压缩偏向解耦
KV 迁移协议	同构简单（单一 RDMA）；异构需 schema-aware + 异构传输
迁移预算	$T_{\text{mig}} < T_{\text{first-token}}$，跨池带宽要够
Speculative decode	消息略大但 < prefill, alpha 仍重要
Paged KV	物理非连续，通信需 scatter-gather DMA 支持
全栈带宽	NVLink + NVSwitch + RDMA + HBM 全部到位

@tbl-infer-08 推理部署核心知识点

参考资料

Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving, OSDI 2024. https://arxiv.org/abs/2401.09670
Patel et al., Splitwise: Efficient Generative LLM Inference Using Phase Splitting, ISCA 2024. https://arxiv.org/abs/2311.18677
Qin et al., Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot, 2024. https://arxiv.org/abs/2407.00079
Chen et al., NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding, arXiv:2605.21100, 2026. https://arxiv.org/abs/2605.21100

被引用于（11）

Decode 阶段 CPinterconnect / LLM并行通信 / 上下文并行
异构 Attention 下的 CPinterconnect / LLM并行通信 / 上下文并行
总览interconnect / LLM并行通信 / 上下文并行
大规模 EP 部署实测interconnect / LLM并行通信 / 专家并行
并行切分的矩阵视角interconnect / LLM并行通信
张量并行 (TP)interconnect / LLM并行通信
总览interconnect / LLM并行通信 / 跨策略横向
推理服务化通信总览interconnect / 推理服务化通信
总览knowledge / 大模型解构
Prefill 与 Decodeknowledge / 大模型解构 / 推理
总览knowledge / 大模型解构 / 推理

Prefill 和 Decode 为什么不对称？​

共驻还是解耦，怎么选？​

KV Cache 迁移协议怎么设计？​

同构 KV cache (V3 / GQA / MQA / MLA)​

异构 KV cache （V4 风格混合 attention）​

迁移带宽需求​

动态 CP 调度：NanoCP 怎么把 CP 度做成 per-request？​

Speculative Decoding 对通信意味着什么？​

独立 draft model​

Multi-Token Prediction (MTP)​

Continuous Batching 与 Paged KV 怎么影响通信？​

Batch 大小动态​

Paged KV cache 的内存非连续​

长上下文部署对互联的具体要求？​

Takeaway​

参考资料​