EP All-to-All 通信建模

α-β 公式怎么算 Dispatch/Combine 延迟、NVLink 和 RDMA 的瓶颈临界点在哪

核心要点：

单 token Dispatch / Combine payload 公式（含非对称精度）

全 batch 通信量，含负载不均衡因子与节点限制路由修正

Normal vs Low-latency 模式的延迟模型

NVLink-bound vs RDMA-bound 临界点

模型预测与 DeepEP 公开实测吻合在 1% 内（大消息） / 启动项主导（小消息）

本节用 $\alpha$-$\beta$ 模型对 MoE EP 的 Dispatch / Combine 两次 AllToAll 做解析建模，给出单 token payload、总通信量、延迟公式、瓶颈临界点，并与 DeepEP 公开实测对照。

前置阅读：

通用名词 (EP / Dispatch / Combine / top-K) → 8.1 总览 Glossary
EP 路由机制 → 8.1 总览
AllToAll 通用算法 → 4.9 AllToAll
$\alpha$-$\beta$ 模型 → 6.2 Alpha-Beta 模型
DeepEP 库实现 → 8.2 DeepEP
EPLB 算法 → 8.3 EPLB
96 H100 端到端实测 → 8.5 大规模 EP 部署实测

符号约定：$B$ = EP 组内 token 总数，$K$ = top-K, $h$ = hidden dim, $s_d$ = 单元素字节数 (FP8=1, BF16=2), $\alpha$ = 启动延迟，$\beta$ = 单方向有效带宽（字节/秒），$N$ = EP 组内 GPU 数，$G$ = 每节点 GPU 数（典型 8），$M_{\text{node}}$ = 单 token 最多触达的节点数 (DeepSeek-V3 = 4)。

单 token Payload 有多大？

Dispatch 阶段单 token 实际发送字节数依赖路由分布：

$$\begin{equation} \text{payload}_{\text{dispatch}}^{\text{token}} = K_{\text{eff}} \cdot h \cdot s_d^{\text{disp}} \label{eq:ep-model-payload-dispatch} \end{equation}$$

$K_{\text{eff}} \leq K$ 是单 token 实际触达的远端 GPU 数（同节点上多个被选中的 Expert 会合并）
DeepSeek-V3 实测平均 $K_{\text{eff}} \approx 3.2$ Expert/节点，远低于朴素 $K=8$[1]

Combine 阶段每 Expert 把 FFN 输出发回源 GPU，公式形式相同，精度不同：

$$\begin{equation} \text{payload}_{\text{combine}}^{\text{token}} = K_{\text{eff}} \cdot h \cdot s_d^{\text{comb}} \label{eq:ep-model-payload-combine} \end{equation}$$

为什么非对称精度？

DeepSeek-V3 / DeepEP 默认 $s_d^{\text{disp}} = 1$ 字节 (FP8 E4M3), $s_d^{\text{comb}} = 2$ 字节 (BF16), Combine payload 是 Dispatch 的 2 倍[2]。

Combine 输出经 top-K 加权求和直接进入 residual stream，量化误差会累积，需要更高精度
Dispatch 输入是 Expert FFN 输入，量化误差可被 Expert 内部 GEMM 吸收，可用 FP8

典型量级 (DeepSeek-V3, $h=7168$, $K=8$)

$$\begin{equation} \text{payload}_{\text{disp}} \approx 8 \cdot 7168 \cdot 1 = 57.3\ \text{KB/token} \label{eq:ep-model-payload-disp-example} \end{equation}$$ $$\begin{equation} \text{payload}_{\text{comb}} \approx 8 \cdot 7168 \cdot 2 = 114.7\ \text{KB/token} \label{eq:ep-model-payload-comb-example} \end{equation}$$

实际 $K_{\text{eff}}$ 受节点限制路由约束，平均值更低。

全 batch 通信量有多大？

均匀路由基线

EP 组内 token 总数 $B$, dispatch 阶段全组总 payload:

$$\begin{equation} M_{\text{total}}^{\text{disp}} = B \cdot K \cdot h \cdot s_d^{\text{disp}} \label{eq:ep-model-total-disp} \end{equation}$$

每对芯片间平均：

$$\begin{equation} m_{\text{pair}}^{\text{disp}} = \frac{B \cdot K \cdot h \cdot s_d^{\text{disp}}}{N^2} \label{eq:ep-model-pair-disp} \end{equation}$$

每芯片发送量：

$$\begin{equation} m_{\text{send}}^{\text{disp}} = \frac{B \cdot K \cdot h \cdot s_d^{\text{disp}}}{N} \label{eq:ep-model-send-disp} \end{equation}$$

负载不均衡修正

路由不均匀时，最重节点收到的 token 数 $R_{\max}$ 决定整组完成时间 （同步语义）。引入不均衡因子：

$$\begin{equation} \eta_{\text{imb}} = \frac{R_{\max}}{R_{\text{avg}}} = \frac{N \cdot R_{\max}}{B \cdot K} \label{eq:ep-model-imb} \end{equation}$$

$\eta_{\text{imb}} = 1$：完全均匀
$\eta_{\text{imb}} > 1$：有热 Expert

GShard 引入 Expert Capacity Factor $C$ （典型 1.0 ~ 1.25）作为硬上界，超出 capacity 的 token 被 drop 或溢出到 backup Expert[3]:

$$\begin{equation} R_{\max} \leq C \cdot \frac{B \cdot K}{N}, \qquad \eta_{\text{imb}} \leq C \label{eq:ep-model-capacity} \end{equation}$$

EPLB 等冗余 Expert 机制通过复制把 $\eta_{\text{imb}}$ 压回接近 1 (详见 8.3 EPLB)。

节点限制路由修正

DeepSeek-V3 限制每 token 最多触达 $M_{\text{node}} = 4$ 个节点: token 先跨节点 RDMA 发到目标节点同 in-node index 的 GPU，再节点内 NVLink 转发[1]。

跨节点 RDMA 流量按节点而非 GPU 聚合：

$$\begin{equation} M_{\text{RDMA}}^{\text{disp}} = B \cdot M_{\text{node}} \cdot h \cdot s_d^{\text{disp}} \label{eq:ep-model-rdma-disp} \end{equation}$$

$M_{\text{RDMA}}$ 不含 $K$ 因子：每 token 跨节点只传一份，节点内 NVLink 再扇出到 $K$ 个 Expert。当 $K > M_{\text{node}}$ 时节点限制路由显著降低 RDMA 流量；DeepSeek-V3 $K=8, M_{\text{node}}=4$，跨节点流量减半。

Normal 与 Low-latency 模式的延迟公式各是什么？

EP AllToAll 在工业部署 (NCCL P2P Direct / DeepEP normal) 上采用并发 send/recv，理论延迟项接近 1 个 $\alpha$。

Normal 模式（训练 / prefill）：带宽主导

DeepEP normal kernel 显式分层 NVLink + RDMA 转发。设节点内 NVLink 单方向有效带宽 $\beta_{\text{NV}}$，跨节点 RDMA 单方向有效带宽 $\beta_{\text{RD}}$，节点数 $N_n = N / G$:

$$\begin{equation} T_{\text{disp}}^{\text{normal}} \approx \alpha_{\text{startup}} + \max\!\left( \frac{m_{\text{NV}}^{\text{disp}}}{\beta_{\text{NV}}},\ \frac{m_{\text{RD}}^{\text{disp}}}{\beta_{\text{RD}}} \right) \cdot \eta_{\text{imb}} \label{eq:ep-model-time-normal} \end{equation}$$

$m_{\text{NV}}^{\text{disp}} = \dfrac{B \cdot K \cdot h \cdot s_d^{\text{disp}}}{N}$：每 GPU 节点内总收发量
$m_{\text{RD}}^{\text{disp}} = \dfrac{B \cdot M_{\text{node}} \cdot h \cdot s_d^{\text{disp}}}{N_n}$：每节点对外 RDMA 发送量

$\alpha_{\text{startup}}$ 含 chunk 流水线建立开销（~几 μs），相比 RDMA 带宽传输时间（毫秒级）可忽略。

Low-latency 模式 (decode)：延迟主导

decode batch 极小，带宽未饱和，走纯 RDMA 路径 （不走 NVLink 转发）：

$$\begin{equation} T_{\text{disp}}^{\text{LL}} \approx \alpha_{\text{IBGDA}} + \frac{B \cdot K \cdot h \cdot s_d^{\text{disp}}}{N \cdot \beta_{\text{RD}}^{\text{small}}} \label{eq:ep-model-time-ll} \end{equation}$$

$\alpha_{\text{IBGDA}}$: IBGDA kernel 内发起 RDMA Put 的启动延迟（不含 CPU 提交），典型 EP=8 时 dispatch 端到端 77 μs
$\beta_{\text{RD}}^{\text{small}}$：小消息场景下的 RDMA 有效带宽，因消息小于特征大小 $n_0$，呈 S 型上升 (详见 6.2 Alpha-Beta 模型)

Combine 阶段公式形式相同，把 $s_d^{\text{disp}}$ 换为 $s_d^{\text{comb}}$。由于非对称精度，Combine 通信量是 Dispatch 的 2 倍，理论延迟也接近 2 倍 — DeepEP 公开数据中 Combine 延迟比 Dispatch 大 1.5 ~ 2 倍[2]。

NVLink-bound 还是 RDMA-bound，怎么判断？

EP normal 模式下，节点内 NVLink 与节点对外 RDMA 并发执行，完成时间取两者较大值 (见 $\eqref{eq:ep-model-time-normal}$)。最直接的做法是按硬件参数代入两路传输时间比较。

H800 + CX7 配置 (DeepEP README)：$B = 4096$, $K = 8$, $h = 7168$, $s_d^{\text{disp}} = 1$ (FP8), $G = 8$, $M_{\text{node}} = 4$, $\beta_{\text{NV}} \approx 153$ GB/s （实测），$\beta_{\text{RD}} \approx 51$ GB/s （实测）。

EP size $N$	节点数 $N_n$	NVLink 路径 $m_{\text{NV}}/\beta_{\text{NV}}$	RDMA 路径 $m_{\text{RD}}/\beta_{\text{RD}}$	主导瓶颈
4	1 （部分节点）	58.7 MB / 153 GB/s = 384 μs	0 （无跨节点）	NVLink
8	1 （单整节点）	28.0 MB / 153 GB/s = 183 μs	0 （无跨节点）	NVLink
16	2	14.7 MB / 153 GB/s = 96 μs	58.7 MB / 51 GB/s = 1151 μs	RDMA
32	4	7.34 MB / 153 GB/s = 48 μs	29.4 MB / 51 GB/s = 576 μs	RDMA
64	8	3.67 MB / 153 GB/s = 24 μs	14.7 MB / 51 GB/s = 288 μs	RDMA
128	16	1.84 MB / 153 GB/s = 12 μs	7.34 MB / 51 GB/s = 144 μs	RDMA

@tbl-ep-model-01 H800 配置下不同 EP size 的 dispatch 单次时延

结论：EP=8 起即跨过单节点，dispatch 主导带宽从 NVLink 切到 RDMA。EP≥16 后 NVLink 路径仅承担节点内 dispatch 扇出，主瓶颈固定在 RDMA — 随 $N$ 增大每节点对外 RDMA 量按 $1/N_n$ 衰减，但仍比 NVLink 项慢一个量级。这与 DeepEP 实测中 EP=8 起跨节点、EP=64 dispatch 瓶颈卡在 51 GB/s RDMA 的现象一致[2]。

直观解释：节点限制路由让跨节点流量按节点而非 GPU 聚合 (每 token 跨节点只发 $M_{\text{node}}$ 份)，但每 GPU 仍要在节点内按 $K$ 完整扇出；RDMA 单端口带宽 (51 GB/s) 远低于 NVLink (153 GB/s)，跨节点一旦发生 RDMA 几乎必然成为瓶颈。

模型预测与实测吻合多少？

DeepEP README 公开数据已在 02-deepep库列出，本节用模型公式对几个关键点做预测对照。

Normal kernel 节点内 EP=8

实测 dispatch 瓶颈 153 GB/s NVLink。模型代入 $\eqref{eq:ep-model-time-normal}$:

$$\begin{equation} m_{\text{NV}}^{\text{disp}} = \frac{4096 \cdot 8 \cdot 7168 \cdot 1}{8} \approx 28\ \text{MB/GPU} \label{eq:ep-model-nv-disp-example} \end{equation}$$

NVLink 单向 160 GB/s 上限，153 / 160 = 95.6% 达成率。取 $\beta_{\text{NV}} = 0.95 \times 160 = 152$ GB/s （计入协议效率），预测瓶颈带宽与实测吻合在 1% 内。

Normal kernel 跨节点 EP=64

实测 dispatch 瓶颈 51 GB/s RDMA。模型代入 $\eqref{eq:ep-model-rdma-disp}$:

$$\begin{equation} M_{\text{RDMA}}^{\text{disp}} / N_n = \frac{4096 \cdot 4 \cdot 7168 \cdot 1}{8} = 14.7\ \text{MB/node} \label{eq:ep-model-rdma-disp-example} \end{equation}$$

CX7 单端口 50 GB/s 单向上限，实测 51 GB/s 略高于单端口理论（可能多端口聚合或测量口径），与 $\beta_{\text{RD}}^{\max} \approx 50$ GB/s 一致。

Low-latency EP=8 dispatch

实测 77 μs / 98 GB/s。模型代入 $\eqref{eq:ep-model-time-ll}$ （low-latency 典型 128 token / GPU）：

$$\begin{equation} \text{payload}_{\text{disp,LL}} = \frac{128 \cdot 8 \cdot 7168 \cdot 1}{8} \approx 917.5\ \text{KB/GPU} \label{eq:ep-model-ll-disp-example} \end{equation}$$

917.5 KB / 98 GB/s ≈ 9.4 μs 纯传输项，与实测 77 μs 差距 67 μs 来自 $\alpha_{\text{IBGDA}}$ 启动开销和小消息 $\beta$ 退化。这说明 low-latency 模式下 $\alpha$ 项主导，与 $\alpha$-$\beta$ 模型"小消息延迟 bound"判据一致。

EP=8 单点 dispatch 98 GB/s 超过单 CX7 端口 50 GB/s 单向上限，此档位每 GPU 占用多块 NIC 通道，详见 8.2 DeepEP。

模型在哪些场景失准？

场景	误差来源	误差量级
路由严重不均衡	$\eta_{\text{imb}}$ 是事后量，路由分布未知时只能用 capacity factor 上界	与 EPLB 协作前可达 50 ~ 200%
Decode 极小 batch	$\beta(n)$ 在 $n \ll n_0$ 时呈 S 型，公式假设常数	30 ~ 100%
NVLink + RDMA 同时拥塞	$\max(\cdot)$ 假设两路独立并发，实际有 SM 抢占和 PCIe 争用	10 ~ 30%
Chunked overlap 流水线	公式没刻画 chunk 边界的串行化开销	5 ~ 20%
网络层拥塞控制振荡	DCQCN / PFC 等动态行为 $\alpha$-$\beta$ 不可表达	> 30%

@tbl-ep-model-02 $\alpha$-$\beta$ 模型在 EP AllToAll 上的局限

与包级仿真的关系：本文公式适用于设计空间快速扫描（秒级评估单点配置）。需要捕捉拥塞振荡或不规则路由的 fine-grain 行为时，用包级仿真 (NS-3 / SimAI)，代价是仿真时间从秒级升到分钟 ~ 小时级。

哪些问题仍未公开？

节点限制路由的最优 $M_{\text{node}}$: DeepSeek-V3 选 4 是经验值，更大 EP ($N \geq 256$) 下是否应该减小？公开资料无系统扫描
$K_{\text{eff}}$ 的解析估算：依赖 Expert 在节点内的放置策略 (EPLB)，目前只有蒙特卡洛估算，无闭式公式
Combine FP8 化的代价：业界对 Combine 强制 BF16 的精度损失阈值没有公开扫描，是否所有 MoE 都需要非对称精度未知

Takeaway

知识点	核心结论
单 token payload	$K_{\text{eff}} \cdot h \cdot s_d$, Combine 因 BF16 是 Dispatch 的 2 倍
负载不均衡因子	$\eta_{\text{imb}}$ 上界 = capacity factor $C$
节点限制路由	RDMA 流量按 $M_{\text{node}}$ 而非 $K$ 计算，大幅降跨节点压力
Normal 模式	带宽主导，$\max(\text{NVLink}, \text{RDMA})$
Low-latency 模式	延迟主导，$\alpha_{\text{IBGDA}}$ 项占比大
EP 跨节点临界	EP=8 起跨节点，EP≥16 RDMA 主导
大消息预测精度	与实测吻合 ≤ 1%
小消息预测精度	启动项主导，纯传输项只占 ~12%

@tbl-ep-model-03 EP 通信建模核心知识点

参考资料

DeepSeek-AI, DeepSeek-V3 Technical Report, arXiv:2412.19437, 2024. https://arxiv.org/abs/2412.19437
DeepSeek-AI, DeepEP: an efficient expert-parallel communication library, GitHub README. https://github.com/deepseek-ai/DeepEP
Lepikhin et al., GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, ICLR 2021. https://arxiv.org/abs/2006.16668

单 token Payload 有多大？​

为什么非对称精度？​

典型量级 (DeepSeek-V3, $h=7168$, $K=8$)​

全 batch 通信量有多大？​

均匀路由基线​

负载不均衡修正​

节点限制路由修正​

Normal 与 Low-latency 模式的延迟公式各是什么？​

Normal 模式 （训练 / prefill）：带宽主导​

Low-latency 模式 (decode)：延迟主导​

NVLink-bound 还是 RDMA-bound，怎么判断？​

模型预测与实测吻合多少？​

Normal kernel 节点内 EP=8​

Normal kernel 跨节点 EP=64​

Low-latency EP=8 dispatch​

模型在哪些场景失准？​

哪些问题仍未公开？​

Takeaway​

参考资料​

延伸阅读​