总览

本章节范围：上下文并行 (Context Parallelism, CP) 的算法与通信——把单条序列沿 token 维度切到多卡，解决长上下文 (128K-1M+) 下单卡装不下 activation 与 KV cache 的问题。 目标读者：设计或评估长上下文并行部署的工程师。

范围与边界 (Scope)

包含：CP 为何长上下文必备、Ring Attention、DeepSpeed-Ulysses、Ring/Ulysses 选型与混合、decode 阶段 CP (pass-Q)、异构 attention 下的 CP 通信。
不包含：
- TP 组内序列切分 (Megatron SP) → 见 4.1 总览
- prefill/decode 部署架构、PD 分离、KV cache 迁移 → 见 9.3 推理部署模式
- 长上下文的模型/算法层（位置编码外推、KV 架构压缩） → 见 knowledge/03-长上下文

名词定义

本章节所有子文档默认这些名词已定义，各子文档只补充本文新引入的名词。

名词	定义
上下文并行 (CP, Context Parallelism)	沿序列 token 维度把单条序列切到多卡，每卡持 $S/N$ 个 token 的中间状态；与 TP/DP/EP 正交的独立维度
Ring Attention	CP 的一种实现：K/V 块在卡间环形传递，配合 online softmax 增量计算，每卡只持 $S/N$ 个 K/V
DeepSpeed-Ulysses	CP 的另一种实现：attention 前后各一次 All-to-All，把张量在「序列切分」与「head 切分」间转置
Online softmax	不保留全部 logits、增量更新 running max/sum 的 softmax，支持 K/V 分块流式计算
Pass-KV	Ring Attention 在 prefill 的形态：K/V 块在环上轮转（数据量大）
Pass-Q	Ring Attention 在 decode 的形态：query 在环上传递（数据量小），各卡用本地 KV 段算 partial attention
Causal 配平	causal mask 下 Ring 各卡负载不均时，用 balanced chunk split / striped 重排序列块拉平负载
USP (Unified Sequence Parallelism)	Ring + Ulysses 的混合：group 内 Ulysses （切 head），group 间 Ring （切序列）
$N_{\text{CP}}$	CP 并行度，序列切分到的卡数
Tree Attention	CP 的 decode 实现：跨卡归约从 ring 环传改为树形 AllReduce，只传 online softmax 三量，通信步数 $O(\log N)$
Linear attention SP	线性注意力 / SSM 的序列并行：传固定 $d\times d$ 状态矩阵而非 K/V，通信量与序列长无关

@tbl-par-cpov-glossary 上下文并行章节共享名词表

CP 为什么是长上下文的必备并行？

长上下文下单卡存储压力来自 activation 与 KV cache 两块，CP 通过切 token 把两者均摊到多卡。

prefill 阶段每层 forward activation 与序列长度线性。隐藏维 $d$、batch 1 时单层约 $S \cdot d \cdot \text{dtype}$:

上下文	$d = 7168$, BF16	备注
8K	0.11 GiB	单卡轻松
128K	1.75 GiB	单卡可装
1M	14 GiB	H100 80 GiB 只能装 5 层左右，配合 activation checkpoint 仍紧张

@tbl-par-cpov-activation 长上下文下单层 activation 的存储压力

decode 阶段累计 KV cache 与 $S$ 线性。即便有 MLA (Multi-head Latent Attention) / GQA (Grouped-Query Attention) 等压缩，1M context 在 V3 风格模型下仍要数十 GB / 序列。

CP 通过切 token 把这两块压力均摊到 $N_{\text{CP}}$ 张卡，每卡只持 $S / N_{\text{CP}}$ 个 token 的中间状态。CP 主要应对 attention 的跨 rank 全局性：一个 rank 的 query 要看的 K/V 可能分布在其他 rank 上。

注意「KV 按 $N_{\text{CP}}$ 分摊」是 ring / Ulysses 式分布式 KV 的属性，不是所有 CP 实现都成立：DSA 稀疏注意力模型（DeepSeek-V3.2 / GLM-5）的生产 CP 走 AllGather 全量 KV + 每卡全量回写路线，只分摊计算不分摊 KV 内存，容量压力交给 MLA 压缩与 PP 承担，见 7.5 异构 Attention 下的 CP。评估 CP 配置的内存收益时必须先确认实现属于哪一类。

CP 在训练与推理各阶段的角色不同：

训练前向+反向 / 推理 prefill：典型 ring 或 ulysses 流量，见 7.2 Ring Attention、7.3 DeepSpeed-Ulysses
推理 decode：每步只新增 1 token，通信形态从 pass-KV 反转为 pass-Q，见 7.4 Decode 阶段 CP；把跨卡归约从环传改为树形 AllReduce 见 7.6 Tree Attention

CP 与序列并行 (SP) 有什么本质区别？

CP 与 Megatron SP 名字相近，但 SP 不是独立并行维度——它的并行度绑死等于 TP 度、不切 attention 计算本身，CP 才是与 TP/DP/EP 正交的独立维度并真正切分 attention。

维度	Megatron SP	Context Parallelism
是否独立并行维度	否（并行度绑死 = TP 度，不增加总卡数）	是（与 TP / DP / EP 正交，增加总卡数）
切分对象	LayerNorm/Dropout 区段激活（序列维）	attention 的 K/V 与全部层激活（序列维）
attention 计算	不切：AG 恢复完整序列后仍是 TP 的 head 并行	切：每卡只算 $S/N_{\text{CP}}$ 个 token 的 Q
处理 attention 全局性	无需处理（每卡看完整序列）	必须解决 attention 跨 rank 的 all-pairs
关键通信	AllGather + ReduceScatter（替代 TP AllReduce）	Ring Attention 的 K/V 环传递
目的	减少 TP 激活冗余 + 把 AllReduce 拆成可 overlap 的 AG+RS	把序列摊到多卡，解决单卡装不下长序列 activation/KV
适用上下文长度	任意（主要优化 overlap 与激活显存）	长序列 (128K+) 必须

@tbl-par-cpov-vs-sp CP 与 Megatron SP 的对比

SP 的命名谱系、机制公式与 overlap 实现见 4.1 总览。

CP 与 PP 比，切的对象一样吗？

CP 与 PP 都能把超长序列摊到多卡，但切的对象不同：CP 切序列 token, PP 切模型层。

维度	CP	PP
切分对象	序列 token	模型层
内存效果	单层 activation 减小 $N_{\text{CP}}$ 倍	单 stage 持有的层数减少 $N_{\text{PP}}$ 倍
通信模式	Ring attention （attention 内部）	P2P stage 间传 activation
Bubble	无	有 ($\sim 1/N_{\text{PP}}$)
适合场景	长序列	深模型

@tbl-par-cpov-vs-pp CP 与 PP 的对比

两者可共用：例如 1M context + 60 层模型，CP=8 + PP=4，每 GPU 持 1/8 序列 × 15 层。

CP 与其他并行维度怎么协同？

CP 与 TP / EP / DP / PP 都是正交维度，4D / 5D 混合并行的全局 GPU 数 = DP × PP × TP × CP × EP。每个 GPU 同时承担多重角色：

Attention 时使用 CP 通信（ring 或 all-to-all）
Dense GEMM 时使用 TP 通信（AllReduce 或 AG+RS）
MoE 时使用 EP 通信 (all-to-all)
反向梯度同步时使用 DP 通信 (AllReduce)

带宽资源存在竞争：

CP 和 TP 通常同时在 attention / MLP 内部，可能竞争同一根 NVLink
EP 在 MoE 子层，与 attention 子层时间错开，与 CP 不直接冲突
DP 梯度同步在反向最后阶段，与前向 CP 不冲突

调度设计的核心是把 CP 和 TP 通信错峰。

CP 与其他并行维度的并发带宽竞争

当 CP 与 TP/EP 同时活跃时，它们共享同一 NVLink 拓扑，带宽竞争直接影响 MFU。业界有三种应对路线：

路线 1: Megatron 互斥 — CP 与 TP overlap 不能同时启用。Yuan et al. (USENIX ATC 2024) 确认 Megatron-LM 在两者共存时直接抛运行时错误。原因：CP 的 ring P2P 与 TP 的 AllReduce 争夺同一 NVLink 带宽和 CUDA stream，同时运行导致严重退化或死锁[1]。实践中：有 CP 时必须禁用 tp_comm_overlap；CP 只可与 DP overlap 安全共存。

路线 2: DeepSeek V3 避 TP — V3 训练刻意不用 TP（H800 NVLink 降至 400 GB/s 后 AllReduce 效率严重下降），改用 EP + DualPipe 双微批次 overlap，实测 all-to-all overlap 达 88.9%[2]。DeepEP V2 用 IB Virtual Lane 把 EP 流量与其他工作负载隔离，SM 使用从 24 降至 4-6[3]。但 V3 训练中 CP 与 EP 并发竞争的公开数据缺失。

路线 3: Helix 时间解耦 — Attention 段用 CP/KVP（序列维度切分），FFN 段用 TP/TP×EP，同一 GPU 池零闲置切换。A2A 切换通信量与序列长度无关（仅 ∝ B×H），HOP-B 在 batch 维度 overlap A2A 与 attention 计算，用时间掩盖而非静态资源隔离[^helix]。

经验规则：

规则	来源	适用场景
CP+TP 并发时禁用 TP overlap	Megatron-LM 运行时约束	Megatron 生态
6-8% SM 可饱和 NVLink 通信带宽	TokenWeave （H100 实测）	AllReduce with NV-SHARP
节点内 NVLink vs 跨节点 IB ≈ 4:1	DeepSeek V3 (H800)	EP 通信路由优化
通信时间 < 计算时间时 overlap 收益最大	Lagom[4]	多维并发 overlap
A2A 切换通信量 ∝ B×H，与序列长度无关	Helix[^helix]	CP→TP 模式切换

@tbl-par-cpov-bandwidth-rules CP 与其他并行维度带宽竞争的经验规则

对仿真建模，推荐三种并发模式供用户选择：(1) 互斥 — T_total = T_CP + T_TP（保守，无竞争）；(2) 错峰 — T_total = max(T_CP, T_TP)（理想，Helix 风格）；(3) 竞争 — 用 Lagom 风格的可减带宽模型 $B_{\text{available}} = B_{\text{peak}} - \sum V(NC_k, C_k)$ 建模挤占效应。

主流框架怎么支持 CP?

框架 / 方案	CP 实现	状态	备注
Megatron-LM	Context Parallel	稳定	Ring 改进版 + balanced load split
DeepSpeed	Ulysses	稳定	all-to-all 重排 head 维度，适合 head 多 + 全连接
USP / yunchang	Ring + Ulysses hybrid	稳定	group 内 Ulysses,group 间 Ring
Meta （生产推理）	pass-KV / pass-Q	生产	prefill 传 KV、decode 传 Q，按 KV miss rate 动态选（交叉阈值 ~5%）
vLLM	DCP (Decode CP) + Helix A2A	DCP 已合并，Helix A2A (PR #34883) 已合并	PCP (Prefill CP) 活跃开发中，vllm-ascend 后端已支持
SGLang	Chunked PP + DSA allgather-CP + Helix (TP+CP)	DSA prefill CP 生产（V3.2/GLM-5，`cp_utils.py`），roadmap (#21788) 继续泛化	序列切分两模式（zigzag in-seq-split / round-robin-split），AllGather 全量 KV + 每卡全量回写；MLA PCP (#23292) / Qwen3-MoE PCP (#18233) 开发中
DeepSeek V4	Round-robin token 分片	生产 (2026-04)	与 CSA/HCA/SWA 混合注意力深度耦合，非 Ring/Ulysses
Untied Ulysses	Headwise chunking	ICML 2026	中间 tensor 内存 -87.5%, 5M tokens on 8×H100
NanoCP	Request-level dynamic CP	arXiv 2026-05	吞吐 +3.27×, MoE 通信解耦；详见 09-推理部署模式
Tree Attention (Zyphra)	树形 AllReduce decode	arXiv 2024-08	decode 比 Ring 快 4-8×, exact, JAX 实现
DistFlashAttn / LightSeq	动态调度 + 重算感知	arXiv 2023-10	causal work-stealing, vs Ring 1.67×, 8× 更长序列

@tbl-par-cpov-frameworks 主流框架与方案对 CP 的支持（更新至 2026-06）

子文档索引 (Index)

7.2 Ring Attention — K/V 环传 + online softmax 增量计算、通信量、overlap 条件、causal 配平
7.3 DeepSpeed-Ulysses — all-to-all 重排 head/seq、Ring vs Ulysses 选型、USP 混合
7.4 Decode 阶段 CP — decode 何时需要 CP、pass-Q 通信、Helix 时序流水、业界进展
7.5 异构 Attention 下的 CP — SWA / CSA / HCA / 稀疏 top-k / 块级稀疏 (MoBA) / 线性注意力 (LASP-2/2H) 的 CP 通信原语；V4 round-robin + DSA (V3.2/GLM-5) AllGather-CP 两个生产案例；先传后选 vs 先选后传两路线 tradeoff
7.6 Tree Attention — 树形 AllReduce 替代 ring 环传、三量归约、通信量与序列长解耦、仅 decode

参考资料

Yuan et al., Accelerating the Training of Large Language Models using Efficient Communication Scheduling, USENIX ATC 2024. https://www.usenix.org/system/files/atc24-yuan.pdf
DeepSeek-AI, Insights into DeepSeek-V3: Scaling Challenges and Reflections, arXiv:2505.09343, 2025. https://arxiv.org/abs/2505.09343
DeepSeek-AI, DeepEP: Efficient Expert-Parallel Communication Library, GitHub, 2025. https://github.com/deepseek-ai/DeepEP
Lagom: Unleashing the Power of Communication and Computation Overlap, arXiv:2602.20656, 2026. https://arxiv.org/abs/2602.20656

被引用于（19）

Decode 阶段 CPinterconnect / LLM并行通信 / 上下文并行
Ring Attentioninterconnect / LLM并行通信 / 上下文并行
Tree Attentioninterconnect / LLM并行通信 / 上下文并行
DeepSpeed-Ulyssesinterconnect / LLM并行通信 / 上下文并行
异构 Attention 下的 CPinterconnect / LLM并行通信 / 上下文并行
并行切分的矩阵视角interconnect / LLM并行通信
总览interconnect / LLM并行通信 / 序列并行
总览interconnect / LLM并行通信
总览interconnect / LLM并行通信 / 跨策略横向
推理部署模式interconnect / LLM并行通信 / 跨策略横向
V4 对通信的新需求knowledge / 前沿模型追踪 / DeepSeek-V4
总览knowledge / 大模型解构 / 注意力机制
Interconnect 资源域knowledge / 推理性能建模
KVCache 架构压缩knowledge / 长上下文
长上下文 — 总览knowledge / 长上下文
长上下文的第一性挑战knowledge / 长上下文
长上下文训练knowledge / 长上下文
G5 上下文并行建模设计规格specs / 上下文并行与部署
长上下文 LLM 部署方案搜索设计规格specs / 上下文并行与部署

范围与边界 (Scope)​

名词定义​

CP 为什么是长上下文的必备并行？​

CP 与序列并行 (SP) 有什么本质区别？​

CP 与 PP 比，切的对象一样吗？​

CP 与其他并行维度怎么协同？​

CP 与其他并行维度的并发带宽竞争​

主流框架怎么支持 CP?​

子文档索引 (Index)​

参考资料​