Skip to content

Add GLM 4.7 SPD sidecar acceleration#866

Draft
i386 wants to merge 107 commits into
mainfrom
jd/jianyang-spd-on-mtp
Draft

Add GLM 4.7 SPD sidecar acceleration#866
i386 wants to merge 107 commits into
mainfrom
jd/jianyang-spd-on-mtp

Conversation

@i386

@i386 i386 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Goal

Add a GLM 4.7 SPD sidecar path and publish the first trained sidecar artifact so we can compare vanilla GLM decode against verified GLM+SPD-sidecar decode on SPEED-Bench prompts.

This PR is the canonical feature branch for GLM 4.7 SPD sidecar acceleration. It supersedes the compact donor work in #860 after review. #859 remains useful proof archaeology, but this PR deliberately does not transplant its broad live-serving/protocol changes.

Published Artifact

Model repo: https://huggingface.co/meshllm/glm-4.7-flash-spd-sidecar

Hub revision: 9aad350802f697c42d5001d1e05e6c7cc1c530e9

The repo contains the exported sidecar plus reproduction artifacts:

  • train/spd-head.safetensors
  • train/speculation_head_final.pt
  • train/skippy-spd-head.json
  • data/draft_vocab_top_16000.json
  • eval/summary/pipeline_eval__train__speculation_head_final__nt12__summary.json
  • repro/train.sh
  • repro/export.sh
  • repro/manifest-validation.log

Exported safetensors:

field value
format safetensors-spd-head-v1
dtype F16
tensors 19
size 321,399,280 bytes
sha256 d9f9d47728d4e3093b272feeb739532452c6780e0fd45a60cc7d62e853c1cdd2

Training Result

Overnight run on micstudio:

uv run --script evals/spd/hf_train_eval_qwen06.py \
  --work-dir /tmp/skippy-spd-glm47-overnight8k \
  --model-name /Volumes/models/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/7dd20894a642a0aa287e9827cb1a1f7f91386b67 \
  --dataset HuggingFaceH4/ultrachat_200k \
  --dataset-split train_sft \
  --train-rows 8192 \
  --eval-rows-per-set 4 \
  --num-stages 3 \
  --stage-layer-boundaries 15,31,47 \
  --num-spec-layers 1 \
  --epochs 1 \
  --max-length 512 \
  --max-new-tokens 64 \
  --batch-size 1 \
  --gradient-accumulation-steps 8 \
  --learning-rate 2e-5 \
  --warmup-steps 50 \
  --save-steps 128 \
  --log-interval 20 \
  --build-draft-vocab-size 16000 \
  --draft-vocab-json '' \
  --draft-top-k 1 \
  --attn-implementation sdpa \
  --device mps \
  --upload-repo none

The run loaded 8192 UltraChat rows, filtered to 7377 usable shifted-label rows, and completed 923 optimizer steps.

Verified donor eval over 12 mini prompts:

metric value
generated tokens 768
decode loop steps 2076
accepted draft flags 109 / 768
acceptance rate 0.3699
equivalent accept length 1.1098
theoretical throughput gain +11.09%

Summary artifact:

/tmp/skippy-spd-glm47-overnight8k/artifacts/20260618-192253/eval/summary/pipeline_eval__train__speculation_head_final__nt12__summary.json

SPEED-Bench Comparison

This is a bounded SPEED-Bench subset using one prompt from each of the 11 SPEED-Bench categories, max_new_tokens=32, greedy decode, draft_top_k=1, and target verification enabled for the SPD row.

Important caveat: this is the Python donor verified evaluator on SPEED-Bench prompts. It is not yet the production Rust/Skippy OpenAI server executing the safetensors sidecar live. The table separates verified acceptance from wall-clock speed, because the Python donor pipeline still has sequential overhead.

Command:

PYTHONPATH=/private/tmp/skippy-spd-glm47-overnight8k/speculative_pipeline_decoding \
/Users/micn/.cache/uv/environments-v2/hf-train-eval-qwen06-51e39d356c3e90ad/bin/python3 \
  /tmp/skippy-spd-glm47-speedbench-subset/eval_speed_bench.py \
  --spec_head_ckpt /tmp/skippy-spd-glm47-overnight8k/artifacts/20260618-192253/train/speculation_head_final.pt \
  --base_model_path /Volumes/models/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/7dd20894a642a0aa287e9827cb1a1f7f91386b67 \
  --data_dir /tmp/skippy-spd-glm47-speedbench-subset/data \
  --output_dir /tmp/skippy-spd-glm47-speedbench-subset/eval \
  --gpus 0 \
  --max_new_tokens 32 \
  --temperature 0.0 \
  --draft_top_k 1 \
  --baseline \
  --baseline_cache_dir /tmp/skippy-spd-glm47-speedbench-subset/baseline

Result summary:

row prompts tokens decode tok/s speedup vs vanilla accepted flags acceptance equiv. accept length theoretical gain
Vanilla GLM 4.7 Flash HF generate 11 352 6.99 1.000x n/a n/a n/a n/a
GLM 4.7 Flash + verified SPD sidecar 11 352 3.29 wall / 6.53 ideal 0.471x wall / 0.935x ideal 59 / 352 0.3789 1.1367 +14.10%

Summary artifact:

/tmp/skippy-spd-glm47-speedbench-subset/eval/summary/pipeline_eval__train__speculation_head_final__nt11__summary.json

Conclusion: the sidecar now has meaningful verified acceptance (EAL=1.1367 on SPEED-Bench subset; EAL=1.1098 on the 12-prompt mini eval), but live wall-clock acceleration still needs Rust/Skippy sidecar execution rather than the donor Python pipeline.

Validation

Python compile:

python3 -m py_compile \
  evals/spd/glm47_frontload.py \
  evals/spd/hf_train_eval_qwen06.py \
  evals/spd/export_spd_head.py \
  evals/spd/simulate_latency.py

Skippy SPD manifest validation:

SKIPPY_SPD_MANIFEST=/tmp/glm-4.7-flash-spd-sidecar-hub/train/skippy-spd-head.json \
  cargo test -p skippy-runtime --features dynamic-native-runtime \
  validates_external_manifest_when_skippy_spd_manifest_is_set

Result:

running 1 test
test spd::tests::validates_external_manifest_when_skippy_spd_manifest_is_set ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 64 filtered out; finished in 0.70s

Earlier local validation on this branch:

cargo test -p skippy-runtime --lib spd
cargo fmt --all -- --check
cargo check -p skippy-runtime
cargo clippy -p skippy-runtime --all-targets -- -D warnings
git diff --check

cargo test -p skippy-runtime --lib spd: 11 passed, 0 failed.

What changed

  • Adds SPD_SKIPPY_PROJECT.md rewritten around the GLM 4.7 SPD sidecar hypothesis.
  • Adds docs/design/GLM47_SPD_EXECUTION_PLAN.md with the GLM sidecar training/export/eval plan.
  • Adds evals/spd/ scripts for GLM checkpoint inspection, reference SPD training/eval, safetensors export, and latency simulation.
  • Adds skippy-runtime::spd manifest/checkpoint/safetensors validation.
  • Fixes the reduced-vocab zero-loss pathology in the donor trainer so GLM SPD training uses usable assistant-labeled shifted targets.
  • Exposes trainer schedule controls needed for short smoke runs and overnight checkpointed training.

Deliberately not included

Next Work

Wire the exported safetensors sidecar into live Skippy/Rust serving, then rerun the same SPEED-Bench comparison through the production OpenAI-compatible path instead of the Python donor evaluator.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1cae6749-7220-4e4e-b6b1-2aa782060299

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jd/jianyang-spd-on-mtp

Comment @coderabbitai help to get the list of available commands and usage tips.

Base automatically changed from feat/jianyang-glm-llama-patches to main June 18, 2026 05:40
@i386 i386 changed the title Add GLM 4.7 SPD-on-MTP experiment path Add GLM 4.7 SPD sidecar acceleration Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant