Add files via upload

FonaTech · web-flow · commit f3ef1938f073 · 2026-04-23T21:35:32.000+08:00
diff --git a/README.md b/README.md
@@ -137,13 +137,13 @@ flowchart LR
 Before M2, the lookahead head was just a head with no real supervision. M2 adds a proper soft-target objective:
 
 $$
-\mathcal{L}_{\text{lookahead}}
-= \frac{1}{|\mathcal{K}_{\text{valid}}|}
-\sum_{k \in \mathcal{K}_{\text{valid}}}
+\mathcal{L}_{\mathrm{lookahead}}
+= \frac{1}{|\mathcal{K}_{\mathrm{valid}}|}
+\sum_{k \in \mathcal{K}_{\mathrm{valid}}}
 \mathbb{E}_{b,t}
 \left[
   - \sum_{e=1}^{E}
-  \operatorname{stopgrad}\!\left(P_{b,t+k,e}\right)
+  \mathrm{sg}\!\left(P_{b,t+k,e}\right)
   \log Q_{b,t,e}^{(k)}
 \right].
 $$
@@ -262,45 +262,42 @@ Honest note: upstream PyTorch does not ship a real OpenCL backend, and Vulkan su
 ## Objective
 
 $$
-\begin{aligned}
-\mathcal{L}_{\text{total}}
-&= \mathcal{L}_{\text{base}}
- + \lambda_{\text{bal}} \mathcal{L}_{\text{aux-raw}}
- + \lambda_{\text{tmp}} \mathcal{L}_{\text{temporal}} \\
-&\quad
- + \lambda_{\text{LA}} \mathcal{L}_{\text{lookahead}}
- + \lambda_{\text{anc}} \mathcal{L}_{\text{router-KL-anchor}} .
-\end{aligned}
+\mathcal{L}_{\mathrm{total}}
+= \mathcal{L}_{\mathrm{base}}
++ \lambda_{\mathrm{bal}} \mathcal{L}_{\mathrm{aux}}
++ \lambda_{\mathrm{tmp}} \mathcal{L}_{\mathrm{temporal}}
++ \lambda_{\mathrm{LA}} \mathcal{L}_{\mathrm{lookahead}}
++ \lambda_{\mathrm{anc}} \mathcal{L}_{\mathrm{routerKL}}
 $$
 
 $$
-\mathcal{L}_{\text{aux-raw}}
-= E \sum_{e=1}^{E} \operatorname{load}_e \cdot \overline{p}_e
+\mathcal{L}_{\mathrm{aux}}
+= E \sum_{e=1}^{E} \mathit{load}_e \cdot \overline{p}_e
 $$
 
 $$
-\mathcal{L}_{\text{temporal}}
+\mathcal{L}_{\mathrm{temporal}}
 = \mathbb{E}_{b,t}
 \left[
   \left\| P_{b,t,:} - P_{b,t-1,:} \right\|_2^2
 \right]
 $$
 
 $$
-\mathcal{L}_{\text{router-KL-anchor}}
+\mathcal{L}_{\mathrm{routerKL}}
 = D_{\mathrm{KL}}
 \left(
-  \pi_{\theta}^{\text{router}}
-  \,\middle\|\,
-  \pi_{\text{ref}}^{\text{router}}
+  \pi_{\theta}^{\mathrm{router}}
+  \| 
+  \pi_{\mathrm{ref}}^{\mathrm{router}}
 \right)
 $$
 
-- $\mathcal{L}_{\text{base}}$: stage-specific objective (`CE`, `DPO`, `ORPO`, `GRPO`, or distillation).
-- $\mathcal{L}_{\text{aux-raw}}$: the unscaled MoE load-balance auxiliary term; Chronos applies $\lambda_{\text{bal}}$ once in `chronos_loss_term`.
-- $\mathcal{L}_{\text{temporal}}$: encourages adjacent tokens to reuse similar expert distributions.
-- $\mathcal{L}_{\text{lookahead}}$: soft-target cross entropy from the real future router distribution to the lookahead prediction.
-- $\mathcal{L}_{\text{router-KL-anchor}}$: keeps alignment-stage updates from destroying the routing layout captured at stage start.
+- $\mathcal{L}_{\mathrm{base}}$: stage-specific objective (`CE`, `DPO`, `ORPO`, `GRPO`, or distillation).
+- $\mathcal{L}_{\mathrm{aux}}$: the unscaled MoE load-balance auxiliary term; Chronos applies $\lambda_{\mathrm{bal}}$ once in `chronos_loss_term`.
+- $\mathcal{L}_{\mathrm{temporal}}$: encourages adjacent tokens to reuse similar expert distributions.
+- $\mathcal{L}_{\mathrm{lookahead}}$: soft-target cross entropy from the real future router distribution to the lookahead prediction. Here $\mathrm{sg}(\cdot)$ means stop-gradient.
+- $\mathcal{L}_{\mathrm{routerKL}}$: keeps alignment-stage updates from destroying the routing layout captured at stage start.
 
 All lambda terms are searchable with Optuna TPE, together with structural hyperparameters such as `hidden_size`, `num_experts`, and `kv_latent_dim`.
 
diff --git a/README_zh.md b/README_zh.md
@@ -131,13 +131,13 @@ flowchart LR
 M2 之前 LookaheadRouter 没有任何监督——只是个未训练的 head。M2 引入：
 
 $$
-\mathcal{L}_{\text{lookahead}}
-= \frac{1}{|\mathcal{K}_{\text{valid}}|}
-\sum_{k \in \mathcal{K}_{\text{valid}}}
+\mathcal{L}_{\mathrm{lookahead}}
+= \frac{1}{|\mathcal{K}_{\mathrm{valid}}|}
+\sum_{k \in \mathcal{K}_{\mathrm{valid}}}
 \mathbb{E}_{b,t}
 \left[
   - \sum_{e=1}^{E}
-  \operatorname{stopgrad}\!\left(P_{b,t+k,e}\right)
+  \mathrm{sg}\!\left(P_{b,t+k,e}\right)
   \log Q_{b,t,e}^{(k)}
 \right].
 $$
@@ -254,45 +254,42 @@ d.describe()       # 人类可读的能力总览
 ## 损失函数（完整形式）
 
 $$
-\begin{aligned}
-\mathcal{L}_{\text{total}}
-&= \mathcal{L}_{\text{base}}
- + \lambda_{\text{bal}} \mathcal{L}_{\text{aux-raw}}
- + \lambda_{\text{tmp}} \mathcal{L}_{\text{temporal}} \\
-&\quad
- + \lambda_{\text{LA}} \mathcal{L}_{\text{lookahead}}
- + \lambda_{\text{anc}} \mathcal{L}_{\text{router-KL-anchor}} .
-\end{aligned}
+\mathcal{L}_{\mathrm{total}}
+= \mathcal{L}_{\mathrm{base}}
++ \lambda_{\mathrm{bal}} \mathcal{L}_{\mathrm{aux}}
++ \lambda_{\mathrm{tmp}} \mathcal{L}_{\mathrm{temporal}}
++ \lambda_{\mathrm{LA}} \mathcal{L}_{\mathrm{lookahead}}
++ \lambda_{\mathrm{anc}} \mathcal{L}_{\mathrm{routerKL}}
 $$
 
 $$
-\mathcal{L}_{\text{aux-raw}}
-= E \sum_{e=1}^{E} \operatorname{load}_e \cdot \overline{p}_e
+\mathcal{L}_{\mathrm{aux}}
+= E \sum_{e=1}^{E} \mathit{load}_e \cdot \overline{p}_e
 $$
 
 $$
-\mathcal{L}_{\text{temporal}}
+\mathcal{L}_{\mathrm{temporal}}
 = \mathbb{E}_{b,t}
 \left[
   \left\| P_{b,t,:} - P_{b,t-1,:} \right\|_2^2
 \right]
 $$
 
 $$
-\mathcal{L}_{\text{router-KL-anchor}}
+\mathcal{L}_{\mathrm{routerKL}}
 = D_{\mathrm{KL}}
 \left(
-  \pi_{\theta}^{\text{router}}
-  \,\middle\|\,
-  \pi_{\text{ref}}^{\text{router}}
+  \pi_{\theta}^{\mathrm{router}}
+  \|
+  \pi_{\mathrm{ref}}^{\mathrm{router}}
 \right)
 $$
 
-- $\mathcal{L}_{\text{base}}$：阶段相关目标（CE / DPO / ORPO / GRPO / KD）。
-- $\mathcal{L}_{\text{aux-raw}}$：未缩放的 MoE load-balance 辅助项；Chronos 在 `chronos_loss_term` 中只乘一次 $\lambda_{\text{bal}}$。
-- $\mathcal{L}_{\text{temporal}}$：约束相邻 token 的路由分布不要剧烈跳变，提高专家复用和缓存局部性。
-- $\mathcal{L}_{\text{lookahead}}$：未来真实路由分布到前瞻预测的 soft-target cross entropy。
-- $\mathcal{L}_{\text{router-KL-anchor}}$：对齐阶段锚定 stage 开始时捕获的参考路由分布，防止 RL/DPO/ORPO/GRPO 梯度破坏聚簇布局。
+- $\mathcal{L}_{\mathrm{base}}$：阶段相关目标（CE / DPO / ORPO / GRPO / KD）。
+- $\mathcal{L}_{\mathrm{aux}}$：未缩放的 MoE load-balance 辅助项；Chronos 在 `chronos_loss_term` 中只乘一次 $\lambda_{\mathrm{bal}}$。
+- $\mathcal{L}_{\mathrm{temporal}}$：约束相邻 token 的路由分布不要剧烈跳变，提高专家复用和缓存局部性。
+- $\mathcal{L}_{\mathrm{lookahead}}$：未来真实路由分布到前瞻预测的 soft-target cross entropy。这里 $\mathrm{sg}(\cdot)$ 表示 stop-gradient。
+- $\mathcal{L}_{\mathrm{routerKL}}$：对齐阶段锚定 stage 开始时捕获的参考路由分布，防止 RL/DPO/ORPO/GRPO 梯度破坏聚簇布局。
 
 `λ` 全部支持 Optuna TPE 自动搜索（包括 `hidden_size` / `num_experts` / `kv_latent_dim` 等结构超参）。