V4 对通信的新需求

核心要点：

无新原语：V4 不引入新通信原语，是已有原语的调度方式 + 流量模式变化

Wave-scheduled EP：多 wave 小消息流水，alpha 主导，需要 16+ channel 并发

Pull-based dispatch: GPU 主动 RDMA Read 远端内存，跳过高延迟通知

Two-stage CP: P2P 边界 KV (4-128 KB) + AllGather compressed KV (4-128 MB)

取消路由约束：All-to-All 发送矩阵从均衡变为 sparse + long-tail

硬件优先级：小消息延迟 + RDMA Read + 多 channel，带宽是次优先级（6144 FLOPs/Byte 平衡点）

DeepSeek V4 的架构创新（mHC / CSA / HCA / MoE 升级 / 1M 上下文 / FP4 量化）从多个方向同时挑战经典 LLM 通信栈。本节汇总 V4 论文中明确描述的通信相关设计，区分论文实证与外部推导，并给出 V4 部署的工程影响清单。

来源标注说明：每项需求标注来源 — [论文 §X.X] 是论文原文明确描述；[推导] 是基于论文架构特征做的通信层面分析，论文本身未讨论。

V4 通信变化总览

核心问题：V4 相比 V3 的通信变化主要落在哪几条？是否引入新原语？

需求	来源	对应文档	类型
取消路由节点数约束	[论文 §2.1]	04-moe	MoE 路由策略变更
Wave-scheduled fine-grained EP	[论文 §3.1]	04-moe	EP 通信-计算 overlap 方案
Pull-based dispatch 通信模式	[论文 §3.1]	本节	通信原语选型
Two-stage CP 通信方案	[论文 §3.4.3]	06-上下文并行	CP 通信方案
异构 KV cache 布局	[论文 §3.5.1, Figure 6]	03-attention	KV cache 管理
四维并行的带宽竞争	[推导]	本节	拓扑映射分析

@tbl-dsv4-comm-overview V4 的通信相关设计

与 V3 相比，V4 没有引入新的通信原语 — AllGather / All-to-All / P2P 等底层原语不变。变化集中在两个层面：(1) 已有原语的调度方式 (wave scheduling / pull-based); (2) 已有原语的流量模式 （取消路由约束后 All-to-All 的发送矩阵变化）。

EP 通信怎么升级了

核心问题：取消路由约束 / wave scheduling / pull-based / FP4 dispatch 各自对通信特征的影响？

取消路由节点数约束 [论文 §2.1]

论文原文："For DeepSeek-V4, we remove the constraint on the number of routing target nodes, and carefully redesign the parallelism strategy to maintain training efficiency."

V3 强制每 token 的 top-k expert 分布在固定数量节点 (见 04-moe), V4 取消此约束。论文没有进一步描述对通信模式的具体影响（没有使用 "sparse" / "irregular" / "imbalanced" 等名词），但业界研究 (RailS [arXiv:2510.19262]、HD-MoE [arXiv:2509.09420]) 指出取消此类约束后 All-to-All 流量趋向 sparse + long-tail 分布。

维度	V3	V4
每 token 命中节点	固定数量（如 4 个）	任意 6 of 384 expert 所在节点
全局发送量矩阵	接近均衡	数据相关，可能出现 long-tail [推导]
MPI 对应接口	`MPI_Alltoall` （等长）	`MPI_Alltoallv` （变长） [推导]

@tbl-dsv4-comm-routing-traffic 取消路由约束对流量分布的影响

MoE 层 5 阶段分解 [论文 §3.1]

讨论 EP 通信前先看 V4 MoE 层的阶段划分 — 论文 §3.1 给出 5 阶段，2 comm-bound + 3 compute（轻+重）：

阶段	类型	占用资源
Dispatch	comm	NVLink / RDMA Read
Linear-1 (gate + up)	compute	Tensor core
Activation + FP8 Cast	compute（轻）	Vector unit
Linear-2 (down)	compute	Tensor core
Combine	comm	NVLink / RDMA Write

@tbl-dsv4-comm-moe-stages V4 MoE 层 5 阶段分解（comm/compute 标注）

关键 invariant（论文 §3.1 verbatim）："within a single MoE layer, the total time of communication is less than that of the computation"。这是 wave-scheduling 把通信完全 overlap 进计算的前提；超过 6144 FLOPs/Byte 平衡点的硬件投入带宽就是浪费的原因也根于此。

Wave-Scheduled Fine-Grained EP [论文 §3.1]

论文 §3.1 原文："we introduce a finer-grained expert partitioning scheme ... we split and schedule the experts into waves. Each wave consists of a small portion of experts."

V4 把 expert 分成多个 wave，每 wave 的 Dispatch All-to-All → Compute → Combine All-to-All 形成流水线 (论文 Figure 5 (c)):

理论加速 1.92× （vs Comet 的 1.42×）
实测通用推理 1.50-1.73×, RL rollout 高达 1.96×
开源实现 MegaMoE (DeepGEMM PR #304)

对互联的影响 [推导]:

需求	说明
小消息启动延迟	wave 数 8-16 时单 wave 消息 ~100 KB 级，alpha 主导
多 channel 并发	不同 wave 独立通道，避免阻塞
完成事件通知	wave 完成后触发下一步计算，需要低延迟的 CQ notification

@tbl-dsv4-comm-wave-needs Wave-scheduled EP 对互联的影响

Pull-Based Dispatch 通信模式 [论文 §3.1]

论文 §3.1 "Communication Primitives" 段原文："In the dispatch stage, we adopt a pull-based approach where each GPU actively reads activations from remote GPUs, avoiding the high notification latency that fine-grained push entails. Future hardware with lower-latency cross-GPU signaling would make push viable and enable more natural communication patterns."

这意味着 V4 的 Dispatch All-to-All 不是传统的 push （发送方主动写），而是 pull （接收方主动读远端内存）。选择 pull 的原因是 fine-grained wave scheduling 下 push 的通知延迟太高。

对硬件的影响：

需要 RDMA Read 能力（GPU 直接读远端 GPU 内存）
NVSHMEM 风格的 symmetric memory 适配这种模式
论文暗示当前硬件的 cross-GPU signaling 延迟是瓶颈，未来硬件改善后可切回 push

V4 给硬件厂商的 4 条 verbatim 提议 [论文 §3.1]

论文 §3.1 末尾 "Observations and Proposals" 段集中给出 4 条具体建议，这是 V4 通信设计经验的副产品：

提议	论文原文摘要	对国产芯片含义
Computation-Communication Ratio	"$C/B \le 2d$ ... once bandwidth meets this threshold, devoting additional silicon area to further bandwidth brings diminishing returns"	互联带宽超 6144 FLOPs/Byte 后投资递减，转向算力 / 电源
Power Budget	"extreme kernel fusion drives compute, memory, and network to high load simultaneously, making power throttling a key performance limiter"	硬件需为 fully-concurrent workloads 留电源裕量，避免 throttling
Communication Primitives	"future hardware with lower-latency cross-GPU signaling would make push viable"	当前 pull-based 是无奈选择，若硬件低延迟 signaling 改善则 push 可行
Activation Function	"we propose replacing SwiGLU with a low-cost element-wise activation that involves no exponential or division operations"	减少 post-GEMM 计算阻塞 GEMM pipeline

@tbl-dsv4-comm-hw-proposals V4 论文 §3.1 给硬件厂商的 4 条 verbatim 提议

第 4 条最反直觉：V4 自己用 SwiGLU 但承认它的 sigmoid + 乘法会阻塞 GEMM pipeline — 这是给下一代模型架构 + 硬件的"双向约束"建议，下一代要么模型不用 SwiGLU、要么硬件加速 SwiGLU 后处理。

硬件平衡点 [论文 §3.1]

V4 论文给出明确的通信-计算比建议：

$$\begin{equation} \frac{C}{B} \le 2d = 6144 \text{ FLOPs/Byte} \label{eq:dsv4-comm-balance} \end{equation}$$

论文原文："Once bandwidth meets this threshold, it ceases to be the bottleneck, and devoting additional silicon area to further bandwidth brings diminishing returns. We encourage future hardware designs to target such balance points rather than scale bandwidth unconditionally."

即每 1 GB/s 互联带宽能隐藏 6.1 TFLOPs/s 计算的通信。超过此比率继续投入带宽收益递减。

FP4 dispatch 让通信量减半 [推导]

V4 expert 权重 FP4 量化（论文 §5.2）后，dispatch 阶段 token 可用 FP4 / FP8 传输（论文 §3.1 中 "FP8 Dispatch + BF16 Combine" 为 Figure 5 图例）：

V3: FP8 dispatch + BF16 combine $\approx 3h$ Bytes / token / expert
V4: FP8 dispatch + BF16 combine $\approx 3h$ Bytes / token / expert （与 V3 相同口径）

绝对通信量未变，但小消息频率因 wave scheduling 上升 — 更靠近网络硬件的 alpha 项性能拐点。

Context Parallelism 怎么做

核心问题：V4 的 CP 用什么原语？CSA 和 HCA 共用一套吗？

论文 §3.4.3 标题为 "Contextual Parallelism for Long-Context Attention"，明确描述了 V4 的 CP 实现方案。CSA 和 HCA 共用同一套 CP 通信。

Stage 1: P2P 边界 KV 交换

论文原文："each rank $i$ sends its last $m$ uncompressed KV entries to rank $i+1$. Then, rank $i+1$ compresses some of these received entries together with its local $s$ uncompressed KV entries, producing a fixed length of $s/m + 1$ compressed entries"

每 rank 把最后 $m$ (CSA) 或 $m'$ (HCA) 个未压缩 KV 发给下一个 rank，接收方用这些边界 KV 和本地 KV 一起做压缩。通信原语：P2P send / recv （与 PP 的 P2P 相同）。

Stage 2: AllGather compressed KV

论文原文："an all-gather operation across all CP ranks collects the locally compressed KV entries. Then, a fused select-and-pad operator reorganizes them into the full set of compressed KV entries with a total length of cp_size $\cdot s/m$"

所有 rank AllGather 各自的 compressed KV，得到全局 compressed KV 的完整集合。通信原语：AllGather （与 TP-SP 的 AllGather 相同）。

本地选择（无通信）

AllGather 完成后，本地按规则或 top-k 决定每个 query 看哪些 entry:

HCA 和 CSA indexer：可见范围由规则预计算 ("the visible range of compressed KV entries for each query token can be precomputed by rules")
CSA sparse attention: "the top-$k$ selector explicitly specifies the indices of visible compressed KV entries for each query"

与通用 CP 文档的关系

异构 attention 下的 CP 做了通用理论推导 — 例如 sparse top-k attention 在 CP 下可以用 distributed top-k + 稀疏 AllGather 方案。但 V4 论文选择了更简单的方案：先 AllGather 全部 compressed KV （因为 CSA 4× 压缩后数据量已经很小），再本地做 top-k。这避免了引入新通信原语。

CP 通信量分析

以 V4-Pro 1M context、CP 度 $N$ 为例：

阶段	通信原语	消息大小 (per rank)	备注
Stage 1: P2P 边界 KV	P2P send / recv	CSA: $m \times c \times s_{\text{dtype}}$ ($m=4, c=512$) $\approx 4$ KB; HCA: $m' \times c \times s_{\text{dtype}}$ ($m'=128$) $\approx 128$ KB	极小，可忽略
Stage 2: AllGather compressed KV	AllGather	CSA: $\frac{S}{N \cdot m} \times c \times s_{\text{dtype}}$ ($S=1$ M, $m=4$); HCA: $\frac{S}{N \cdot m'} \times c \times s_{\text{dtype}}$ ($m'=128$)	CSA 数据量 > HCA

@tbl-dsv4-comm-cp-volume CP 两阶段通信量

以 $N=8$ (8 rank CP)、FP8 ($s_{\text{dtype}}=1$) 为例：

CSA Stage 2 per rank: $\frac{1\text{M}}{8 \times 4} \times 512 \times 1 = 16$ MB, AllGather 后全局 128 MB
HCA Stage 2 per rank: $\frac{1\text{M}}{8 \times 128} \times 512 \times 1 = 500$ KB, AllGather 后全局 4 MB

CSA 的 AllGather 数据量是 HCA 的 32 倍 ($= m'/m = 128/4$)，但仍远小于 dense attention 的 ring 通信量 ($\sim S \cdot d_{\text{model}} = 7$ GB)。

跨集群 KV 复用：on-disk 三策略 [论文 §3.5.2]

核心问题：V4 的 shared-prefix 复用方案对存储 / 网络 / 重算各有什么权衡？

论文 §3.5.2 专题讨论 V4 的 on-disk KV cache storage — 为了 shared-prefix 场景（多请求共享前缀）跨集群复用 KV 而设计。V3 没有这一节。这是 V4 通信栈中唯一直接涉及大规模磁盘 / 跨集群网络的新增设计。

通信流量与 SWA 的不对称

Cache 类型	落盘量
CSA Indexer KV + CSA Main KV + HCA KV	已被 $m$ / $m'$ 压缩，体量小
SWA KV	每 layer 都要存最近 128 token，未压缩 — 约为压缩 KV 的 8 倍

@tbl-dsv4-comm-swa-vs-compressed SWA vs 压缩 KV 的落盘体量差异

SWA 是 on-disk 复用的主要带宽消耗者。

三种 SWA on-disk 策略

策略	存储	网络读	重算	适用
Full SWA Caching	全量 SWA / token / layer 落盘	仅读最近 $n_{\text{win}}$ token 的 KV	零	计算受限；但 SSD write 放大严重
Periodic Checkpointing	每 $p$ token 存一次	加载最近 checkpoint	重算 tail 至 $n_{\text{win}}$	默认策略，$p$ 可调权衡
Zero SWA Caching	不落盘 SWA	仅读 CSA/HCA 压缩 KV	重算最后 $n_{\text{win}} \cdot L$ token 的 SWA	存储极受限

@tbl-dsv4-comm-on-disk V4 on-disk SWA KV 三策略

Zero SWA Caching 的通信含义：完全用本地重算换掉 SWA 网络读，对跨集群 RDMA 带宽要求最低 — V3 时代估计 prefill-decode 分离需要 100+ GB/s 跨集群带宽，V4 因 KV cache 压缩 + Zero SWA 选项降到 ~10 GB/s 量级[推导]。这对国产芯片跨节点 RoCE 网络是显著松绑。

与 disaggregated prefill-decode 的关系

[推导] V4 的异构 KV 布局让 disaggregated 部署的 KV 迁移协议更复杂：

元素	迁移协议要求
State Cache（SWA + tail）	按 request 整块迁移；tail 长度不一致需 padding 协议
KV Cache（CSA Indexer + Main + HCA）	按 block 迁移；block 大小 = lcm($m, m'$) 对齐
三种 entry 精度异构	元数据描述每段精度（BF16/FP8/FP4），接收方按段反量化

@tbl-dsv4-comm-disaggregated-migration Disaggregated KV 迁移的协议要求

V4 论文未公开 disaggregated 部署细节，但异构布局约束了迁移协议必须能描述段内精度 — 不能再用 V3 时代"整张 KV tensor 一把传"的简单协议。

异构 KV Cache 布局有什么含义 [论文 §3.5.1]

核心问题：V4 的 KV cache 布局怎么影响推理框架 / 跨集群迁移？

论文 §3.5.1 详细描述了 V4 的 KV cache 被组织成两个独立结构（论文 Figure 6）：

State Cache (per request): SWA KV + 未压缩 tail token
KV Cache (block pool): CSA Indexer KV / CSA Main KV / HCA KV 三种 compressed entry

不同 cache 类型有不同的 shape / 精度 / 更新频率和淘汰策略。论文没有讨论跨集群 KV 迁移（disaggregated prefill-decode 场景），但异构布局在迁移场景下会增加协议复杂度 [推导]。

四维并行的带宽竞争点在哪 [推导]

核心问题：TP / DP / CP / EP 同卡竞争 NVLink 时怎么协调？

V4 部署典型并行配置：

GPU 总数 = DP × TP × CP × EP

每 GPU 同时承担四种角色，竞争同一节点的 NVLink 带宽。

时间线协调

V4 单层 prefill 的通信事件序列：

[t0] Attention 子层开始
  [t0~t1] CP 通信:
    - Stage 1: P2P 边界 KV (极小)
    - Stage 2: AllGather compressed KV
  [t1~t2] TP AllReduce: attention output 跨 TP 合并
[t2] Attention 子层结束, MoE 子层开始
  [t2~t3] mHC 残差混合 (本地, 无通信)
  [t3~t4] EP 通信 (wave-scheduled):
    - Wave 1 dispatch → wave 1 compute → wave 1 combine
    - Wave 2 dispatch ...
[t4] 层结束

关键带宽竞争点：

时段	抢同一 NVLink 的角色	解决方案
$t0 \sim t2$	CP 通信 + TP 通信	分时（CP 先做，TP 后做），或不同 NVLink 分组
$t3 \sim t4$	EP 多 wave 之间	多 channel 隔离
跨层	反向 DP 梯度（训练）	与前向错峰，通常无冲突

@tbl-dsv4-comm-bw-contention 四维并行的带宽竞争点

拓扑约束

V4 对部署拓扑的隐含要求 [推导]:

TP within node + CP within node：两者都吃 NVLink，需要同节点内的高带宽（NVL72 / NVSwitch 类）
EP across node：跨节点 All-to-All，吃 IB / RoCE
CP can across node: CP AllGather 流量不大 (compressed KV)，对延迟不敏感

CP 与 PP 的对比

PP 也能把超长序列摊到多卡，但方式不同：

维度	CP	PP
切分对象	序列 token	模型层
内存效果	单层 activation 减小 $N_{\text{CP}}$ 倍	单 stage 持有的层数减少 $N_{\text{PP}}$ 倍
通信模式	Two-stage: P2P + AllGather [论文 §3.4.3]	P2P stage 间传 activation
Bubble	无	有 ($\sim 1/N_{\text{PP}}$)
适合场景	长序列	深模型

@tbl-dsv4-comm-cp-vs-pp CP 与 PP 的对比

两者可以共用，例如 1M context + 60 层模型：CP=8 + PP=4，每 GPU 持有 1/8 序列 × 15 层。

mHC / MTP 有什么间接通信影响

核心问题：mHC 4 路残差 / MTP 模块带来什么额外通信代价？

mHC 的隐性代价

mHC 4 路残差对跨 rank 通信几乎无影响 (token-wise local，见 02-mhc)，但：

Activation 内存 4× → 间接逼出更细 TP 切分 → TP AllReduce 频率上升
论文 §3.4.2 报告 mHC 的 wall-time overhead 仅 6.7% （通过 fused kernel + recomputation + DualPipe 调整）

MTP 的额外通信

MTP depth=1 让 decode 每步多一个 mini-block:

每层多一次 TP AllReduce + MoE All-to-All
消息粒度小，alpha 主导
失败回退几乎无成本（仅本地 logits 废弃）

实际影响：decode TPOT 略增，但接受率高时净收益正。

Pro vs Flash 通信差异在哪

核心问题：两个变体的部署侧通信压力差多少？

两个变体共享所有架构创新（mHC / CSA / HCA / Muon / MoE 框架），仅规模与少数关键参数不同。

关键参数差异（仅列与通信相关）

维度	Flash	Pro	比值
层数 $L$	43	61	1.42×
Hidden $d$	4096	7168	1.75×
前 2 层 attention	纯 SWA	纯 HCA	模式不同
CSA top-k	512	1024	2×
MoE routed experts	256	384	1.5×
Attention query head	64	128	2×
Query 压缩维 $d_c$	1024	1536	1.5×

@tbl-dsv4-comm-pro-vs-flash-cfg Pro / Flash 与通信相关的配置差异

部署侧影响

维度	Flash	Pro
推理硬件门槛	13B 激活 → 单节点可装	49B 激活 → 必须多节点 TP
CP AllGather 数据量 (CSA, 1M, CP=8)	每 rank ~9 MB	每 rank ~16 MB
前 2 层 CP 通信	几乎免费（SWA 本地）	需 AllGather HCA compressed KV
EP 切分粒度	256 expert	384 expert 需更宽 EP
CSA top-k 通信量	无额外通信（本地选择）	无额外通信（本地选择）

@tbl-dsv4-comm-pro-vs-flash-deploy Pro / Flash 部署侧的通信压力差异

注意：CSA top-k 选择在论文方案中是 AllGather 之后的本地操作，Pro 的 top-k=1024 vs Flash 的 top-k=512 不影响通信量，只影响本地计算量。

对下一代国产芯片的硬件建议

核心问题：V4 时代相对 V3 时代的硬件优先级怎么变？

如果以 V4 为目标 workload 设计下一代国产芯片互联：

维度	V3 时代建议	V4 时代建议	来源
Peak 互联带宽	越高越好	6144 FLOPs/Byte 平衡点后投回算力	[论文 §3.1]
小消息启动延迟	中等优先级	高优先级（wave-scheduled + decode 主导）	[论文 §3.1]
RDMA Read 能力	可选	必备（pull-based dispatch 需要 GPU 主动读远端内存）	[论文 §3.1]
多 channel / QPS	4-8 个够用	16+ 个（wave 数对应）	[推导]
跨集群 RDMA	中等需求	中等需求（V4 cache 小，迁移压力下降）	[推导]
Scatter-gather DMA	中等需求	高优先级（PagedAttention + 异构 cache）	[推导]

@tbl-dsv4-comm-hw-priorities V3 时代与 V4 时代的硬件需求优先级对比

相比旧版建议的变化：

新增 RDMA Read：论文明确采用 pull-based dispatch，这是 V3 没有的硬件要求
移除"自适应路由必备"：论文没有讨论路由策略，"sparse All-to-All 需要自适应路由"是外部推导，降为 [推导] 级别
移除"Distributed top-k 加速"：论文的 CP 方案用 AllGather + 本地 top-k，不需要新通信原语

Takeaway

知识点	核心结论
无新原语	V4 不引入新通信原语，是已有原语的调度方式 + 流量模式变化
MoE 层 5 阶段	Dispatch / L1 / Act+Cast / L2 / Combine — 通信时间 < 计算时间是 overlap 前提
Wave-scheduled EP	多 wave 小消息流水，alpha 主导，需要 16+ channel 并发
Pull-based dispatch	GPU 主动 RDMA Read 远端内存，跳过高延迟通知
Two-stage CP	P2P 边界 KV (4-128 KB) + AllGather compressed KV (4-128 MB)
取消路由约束	All-to-All 发送矩阵从均衡变 sparse + long-tail，对应 `MPI_Alltoallv`
硬件平衡点	$C/B \le 2d = 6144$ FLOPs/Byte，超过此带宽收益递减
硬件 4 提议	C/B ratio / Power Budget / 低延迟 signaling / 替代 SwiGLU 的轻量 activation
跨集群 on-disk	Full / Periodic-$p$ / Zero SWA 三策略；SWA 占落盘 8 倍但可换重算消除
Disaggregated KV 迁移	异构布局约束迁移协议必须描述段内精度
硬件优先级	小消息延迟 + RDMA Read + 多 channel 是高优先级；带宽是次优先级

V4 通信变化总览​

EP 通信怎么升级了​

取消路由节点数约束 [论文 §2.1]​

MoE 层 5 阶段分解 [论文 §3.1]​

Wave-Scheduled Fine-Grained EP [论文 §3.1]​

Pull-Based Dispatch 通信模式 [论文 §3.1]​

V4 给硬件厂商的 4 条 verbatim 提议 [论文 §3.1]​

硬件平衡点 [论文 §3.1]​

FP4 dispatch 让通信量减半 [推导]​

Context Parallelism 怎么做​

Stage 1: P2P 边界 KV 交换​

Stage 2: AllGather compressed KV​

本地选择 （无通信）​

与通用 CP 文档的关系​

CP 通信量分析​

跨集群 KV 复用：on-disk 三策略 [论文 §3.5.2]​

通信流量与 SWA 的不对称​

三种 SWA on-disk 策略​

与 disaggregated prefill-decode 的关系​

异构 KV Cache 布局有什么含义 [论文 §3.5.1]​

四维并行的带宽竞争点在哪 [推导]​

时间线协调​

拓扑约束​

CP 与 PP 的对比​

mHC / MTP 有什么间接通信影响​

mHC 的隐性代价​

MTP 的额外通信​

Pro vs Flash 通信差异在哪​

关键参数差异 （仅列与通信相关）​

部署侧影响​

对下一代国产芯片的硬件建议​

Takeaway​

参考资料​