总览

本章节范围：把 05 章组装好的 Transformer 骨架放在 trillion 级 token 上跑训练，让它从随机权重变成有语言能力的 LLM。包含训练目标 (next-token CLM)、训练循环 + 数据准备、scaling laws 三大主线。 目标读者：工程师 / 研究者；假设已读 05-组装GPT，知道模型结构。

范围与边界

包含：CLM 交叉熵 loss / shift-by-one / teacher forcing / 训练数据准备（来源 + 去重 + 配比） / 训练循环 (forward + backward + optimizer + lr schedule) / Kaplan 2020 + Chinchilla 2022 + 后续 scaling laws 演化 / 涌现能力 (emergence) 简述。
不包含（各项外链）：
- 微调 / 对齐 (SFT / RLHF / DPO) → 07-微调与对齐，本章只到"预训练完，模型有基础语言能力" 为止
- 推理 / decode / KV cache → 08-推理
- 数据隐私 / 版权 / 数据治理：工程话题，不在本章
- 完整分布式训练并行 (TP / PP / DP / ZeRO) → interconnect/05-LLM并行通信，本章只点到
- 模型结构选择 （depth/width/head 等） → 05-组装GPT/04-block与堆叠
- 量化训练 (FP8 / INT8)：不在本章，仅在 scaling laws 节点到
- Continual pretraining / domain adaptation：偏工程，不展开

名词定义

本章节子文档默认这些名词已定义，不再重复；父总览 1 总览已定义的 (Pretraining / CLM / Forward / Backward) 在此不重列。

名词	定义
Cross-entropy loss	LLM 训练目标：$\mathcal{L} = -\sum_t \log P(x_t \mid x_{<t})$，衡量模型预测分布与真实下一 token 的散度
Perplexity （困惑度）	$\exp(\mathcal{L} / N)$，平均每 token 的"模型困惑程度", LLM 评估的标准指标
Shift-by-one	训练数据 input = `[t_1, ..., t_s]`, label = `[t_2, ..., t_{s+1}]`，位置 $i$ 预测下一个；配合 causal mask 让所有位置并行算 loss
Teacher forcing	训练时把 ground truth 整段输入模型，让每位置独立预测 next token，不串行生成（见 04-04 因果掩码）
Token budget ($D$)	预训练时总训练 token 数，Llama 3 用 15T, GPT-3 用 300B, Chinchilla 70B 用 1.4T
Compute budget ($C$)	训练总算力 FLOPs，经典近似 $C \approx 6ND$ ($N$ = 模型参数，$D$ = token 数)
Optimal allocation	Chinchilla 的核心：给定 $C$ 怎么分配 $N$ 和 $D$ 让 loss 最小
Emergence （涌现能力）	模型规模超过某阈值后突然出现的能力（in-context learning / CoT 推理），业界对其真实性仍有争议
Loss spike （loss 尖峰）	训练过程中 loss 突然飙升的现象，大模型训练常见，需要 LR 退火 + 数据跳过

@tbl-pretrain-glossary 本章共享名词

子文档索引

按"目标 → 过程 → 规律" 顺序排列：

篇	一句话	独占技术内核（写到深）	负边界（不展开）
02-语言建模目标	交叉熵 loss + shift-by-one + perplexity	CLM 交叉熵公式 → shift-by-one 数据构造 → softmax + cross-entropy 的数值实现 → perplexity 定义与意义 → LM head + weight tying 复用（引 03-token-embedding） → temperature 与 label smoothing 简述 → CLM 与 MLM 训练信号密度对比（引 04-因果掩码）	不展开 BERT MLM 完整训练目标；不讲 SFT / RLHF 的 loss （归 07 章）；不讲 reasoning RL （o1/R1 系）
03-训练循环与数据	数据准备 + forward/backward/optimizer + lr schedule	训练数据来源 (CommonCrawl / C4 / FineWeb / RedPajama) → 去重 (MinHash / SimHash) → 数据配比（代码 / 数学 / 多语言） → 数据墙 (data wall) → batch size + gradient accumulation → AdamW 默认配置 → lr schedule (linear warmup + cosine decay) → loss spike 与 skip 机制 → mixed precision (bf16)	完整分布式并行 (TP/PP) 外链 interconnect；数据隐私 / 版权 / RLHF 数据均不在；QAT 量化训练不展开
04-scaling laws	Kaplan 2020 → Chinchilla 2022 → 后续修正	$C \approx 6ND$ 推导 → Kaplan 2020 三条 power law (N/D/C) → Kaplan vs Chinchilla 的"最优分配" 之差 → Chinchilla 公式 $N_{\text{opt}} \propto C^{0.49}$, $D_{\text{opt}} \propto C^{0.51}$ → Chinchilla 70B vs GPT-3 175B 实证 → 后续 DeepSeek / Llama 3 等对 Chinchilla 的修正 → 数据墙问题 → 涌现能力争议（Schaeffer 2023 反驳） → inference-aware scaling （Llama 3 偏离 Chinchilla 的工程动机）	不推完整数学证明；不讲多模态 scaling；不讲 RLHF / SFT 的 compute 估算

@tbl-pretrain-index 子文档索引（含边界契约）

预训练在 LLM 整体的位置

本章是 "把骨架变成 LLM 的能力" 阶段：

阶段	输入	输出	算力占比
预训练（本章）	海量无标注文本 + 随机权重模型	有基础语言能力的"base model"	$\sim 99\%$
微调（07 章）	高质量指令数据 + base model	能听懂指令的 instruct model	$\sim 1\%$
对齐（07 章）	偏好数据 + instruct model	安全有用的 aligned model	$< 1\%$
推理（08 章）	aligned model + 用户输入	生成响应	（用户侧，长期）

@tbl-pretrain-stage-cost LLM 训练四阶段算力占比

预训练吃掉 99% 算力：GPT-3 训练用 3.14E23 FLOPs, Llama 3 405B 用 3.8E25 FLOPs。这是 LLM 时代"算力即权力" 的根源。

与外部专题的接缝

外链主题	在本章哪里引	目标
分布式训练并行 (TP / PP / DP / ZeRO)	03-训练循环末尾	interconnect/05-LLM并行通信
模型结构选择	04-scaling laws 节点到	05-组装GPT/04-block与堆叠
KV cache 推理优化	不在本章	08-推理/03-kv-cache
微调与对齐	04-scaling laws 末尾	07-微调与对齐
业界模型对比	04-scaling laws 末尾	knowledge/01-业界动态
量化 (FP8/INT8)	不在本章	08-推理/05-量化简介

@tbl-pretrain-external 与外部专题的外链对照

参考资料

教学蓝本：

Sebastian Raschka. Build a Large Language Model (From Scratch). Manning, 2024. Chapter 5.

关键论文：

Kaplan et al. Scaling Laws for Neural Language Models. 2020. https://arxiv.org/abs/2001.08361
Hoffmann et al. Training Compute-Optimal Large Language Models (Chinchilla). 2022. https://arxiv.org/abs/2203.15556
Brown et al. Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020. https://arxiv.org/abs/2005.14165
Meta AI. The Llama 3 Herd of Models. 2024. https://arxiv.org/abs/2407.21783

章内交叉引用：父总览 1 总览给全章节地图与共享名词 / 形状约定。

范围与边界​

名词定义​

子文档索引​

预训练在 LLM 整体的位置​

与外部专题的接缝​

参考资料​