Add GLM 4.7 SPD sidecar acceleration#866
Draft
i386 wants to merge 107 commits into
Draft
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
17 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Add a GLM 4.7 SPD sidecar path and publish the first trained sidecar artifact so we can compare vanilla GLM decode against verified GLM+SPD-sidecar decode on SPEED-Bench prompts.
This PR is the canonical feature branch for GLM 4.7 SPD sidecar acceleration. It supersedes the compact donor work in #860 after review. #859 remains useful proof archaeology, but this PR deliberately does not transplant its broad live-serving/protocol changes.
Published Artifact
Model repo: https://huggingface.co/meshllm/glm-4.7-flash-spd-sidecar
Hub revision:
9aad350802f697c42d5001d1e05e6c7cc1c530e9The repo contains the exported sidecar plus reproduction artifacts:
train/spd-head.safetensorstrain/speculation_head_final.pttrain/skippy-spd-head.jsondata/draft_vocab_top_16000.jsoneval/summary/pipeline_eval__train__speculation_head_final__nt12__summary.jsonrepro/train.shrepro/export.shrepro/manifest-validation.logExported safetensors:
safetensors-spd-head-v1F16d9f9d47728d4e3093b272feeb739532452c6780e0fd45a60cc7d62e853c1cdd2Training Result
Overnight run on
micstudio:uv run --script evals/spd/hf_train_eval_qwen06.py \ --work-dir /tmp/skippy-spd-glm47-overnight8k \ --model-name /Volumes/models/huggingface/hub/models--zai-org--GLM-4.7-Flash/snapshots/7dd20894a642a0aa287e9827cb1a1f7f91386b67 \ --dataset HuggingFaceH4/ultrachat_200k \ --dataset-split train_sft \ --train-rows 8192 \ --eval-rows-per-set 4 \ --num-stages 3 \ --stage-layer-boundaries 15,31,47 \ --num-spec-layers 1 \ --epochs 1 \ --max-length 512 \ --max-new-tokens 64 \ --batch-size 1 \ --gradient-accumulation-steps 8 \ --learning-rate 2e-5 \ --warmup-steps 50 \ --save-steps 128 \ --log-interval 20 \ --build-draft-vocab-size 16000 \ --draft-vocab-json '' \ --draft-top-k 1 \ --attn-implementation sdpa \ --device mps \ --upload-repo noneThe run loaded 8192 UltraChat rows, filtered to 7377 usable shifted-label rows, and completed 923 optimizer steps.
Verified donor eval over 12 mini prompts:
Summary artifact:
/tmp/skippy-spd-glm47-overnight8k/artifacts/20260618-192253/eval/summary/pipeline_eval__train__speculation_head_final__nt12__summary.jsonSPEED-Bench Comparison
This is a bounded SPEED-Bench subset using one prompt from each of the 11 SPEED-Bench categories,
max_new_tokens=32, greedy decode,draft_top_k=1, and target verification enabled for the SPD row.Important caveat: this is the Python donor verified evaluator on SPEED-Bench prompts. It is not yet the production Rust/Skippy OpenAI server executing the safetensors sidecar live. The table separates verified acceptance from wall-clock speed, because the Python donor pipeline still has sequential overhead.
Command:
Result summary:
Summary artifact:
/tmp/skippy-spd-glm47-speedbench-subset/eval/summary/pipeline_eval__train__speculation_head_final__nt11__summary.jsonConclusion: the sidecar now has meaningful verified acceptance (
EAL=1.1367on SPEED-Bench subset;EAL=1.1098on the 12-prompt mini eval), but live wall-clock acceleration still needs Rust/Skippy sidecar execution rather than the donor Python pipeline.Validation
Python compile:
Skippy SPD manifest validation:
SKIPPY_SPD_MANIFEST=/tmp/glm-4.7-flash-spd-sidecar-hub/train/skippy-spd-head.json \ cargo test -p skippy-runtime --features dynamic-native-runtime \ validates_external_manifest_when_skippy_spd_manifest_is_setResult:
Earlier local validation on this branch:
cargo test -p skippy-runtime --lib spd cargo fmt --all -- --check cargo check -p skippy-runtime cargo clippy -p skippy-runtime --all-targets -- -D warnings git diff --checkcargo test -p skippy-runtime --lib spd: 11 passed, 0 failed.What changed
SPD_SKIPPY_PROJECT.mdrewritten around the GLM 4.7 SPD sidecar hypothesis.docs/design/GLM47_SPD_EXECUTION_PLAN.mdwith the GLM sidecar training/export/eval plan.evals/spd/scripts for GLM checkpoint inspection, reference SPD training/eval, safetensors export, and latency simulation.skippy-runtime::spdmanifest/checkpoint/safetensors validation.Deliberately not included
codex/skippy-spd-proof.Next Work
Wire the exported safetensors sidecar into live Skippy/Rust serving, then rerun the same SPEED-Bench comparison through the production OpenAI-compatible path instead of the Python donor evaluator.