You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
= E \sum_{e=1}^{E} \operatorname{load}_e \cdot \overline{p}_e
274
+
\mathcal{L}_{\mathrm{aux}}
275
+
= E \sum_{e=1}^{E} \mathit{load}_e \cdot \overline{p}_e
279
276
$$
280
277
281
278
$$
282
-
\mathcal{L}_{\text{temporal}}
279
+
\mathcal{L}_{\mathrm{temporal}}
283
280
= \mathbb{E}_{b,t}
284
281
\left[
285
282
\left\| P_{b,t,:} - P_{b,t-1,:} \right\|_2^2
286
283
\right]
287
284
$$
288
285
289
286
$$
290
-
\mathcal{L}_{\text{router-KL-anchor}}
287
+
\mathcal{L}_{\mathrm{routerKL}}
291
288
= D_{\mathrm{KL}}
292
289
\left(
293
-
\pi_{\theta}^{\text{router}}
294
-
\,\middle\|\,
295
-
\pi_{\text{ref}}^{\text{router}}
290
+
\pi_{\theta}^{\mathrm{router}}
291
+
\|
292
+
\pi_{\mathrm{ref}}^{\mathrm{router}}
296
293
\right)
297
294
$$
298
295
299
-
- $\mathcal{L}_{\text{base}}$: stage-specific objective (`CE`, `DPO`, `ORPO`, `GRPO`, or distillation).
300
-
- $\mathcal{L}_{\text{aux-raw}}$: the unscaled MoE load-balance auxiliary term; Chronos applies $\lambda_{\text{bal}}$ once in `chronos_loss_term`.
301
-
- $\mathcal{L}_{\text{temporal}}$: encourages adjacent tokens to reuse similar expert distributions.
302
-
- $\mathcal{L}_{\text{lookahead}}$: soft-target cross entropy from the real future router distribution to the lookahead prediction.
303
-
- $\mathcal{L}_{\text{router-KL-anchor}}$: keeps alignment-stage updates from destroying the routing layout captured at stage start.
296
+
- $\mathcal{L}_{\mathrm{base}}$: stage-specific objective (`CE`, `DPO`, `ORPO`, `GRPO`, or distillation).
297
+
- $\mathcal{L}_{\mathrm{aux}}$: the unscaled MoE load-balance auxiliary term; Chronos applies $\lambda_{\mathrm{bal}}$ once in `chronos_loss_term`.
298
+
- $\mathcal{L}_{\mathrm{temporal}}$: encourages adjacent tokens to reuse similar expert distributions.
299
+
- $\mathcal{L}_{\mathrm{lookahead}}$: soft-target cross entropy from the real future router distribution to the lookahead prediction. Here $\mathrm{sg}(\cdot)$ means stop-gradient.
300
+
- $\mathcal{L}_{\mathrm{routerKL}}$: keeps alignment-stage updates from destroying the routing layout captured at stage start.
304
301
305
302
All lambda terms are searchable with Optuna TPE, together with structural hyperparameters such as `hidden_size`, `num_experts`, and `kv_latent_dim`.
0 commit comments