feat(speculative): DraftModel trait + spec-decode primitives#144
feat(speculative): DraftModel trait + spec-decode primitives#144dusterbloom wants to merge 5 commits into
Conversation
Audit finding (2026-05-04): of the 3 commits originally planned for PR-3 (paged prefix cache, chunked prefill, trim_by), 2 are already on origin/main in superset form via panbanda's PR panbanda#74 squash and the 1514737 TurboQuant landing: - paged_prefix_cache.rs: main has 1072 lines (1072 vs our 855) with TqBlock + slice_axis1 + conv_pos additions on top of our work. - chunked_prefill: main wires it into a 2453-line simple.rs (vs our 1508), with compute_prefill_chunk_size + forward_chunked + the early validation our version was missing. - SteppingKeyValueCache::trim_by(usize): main has the per-layer trim with saturating_sub overflow guard (lib.rs:457). The genuinely-new piece is the AnyCache-level dispatcher: a single helper that walks every layer in either KV or Hybrid variant and trims the underlying SteppingKeyValueCache, while intentionally leaving LayerCache::Arrays (recurrent SSM state) untouched. Required by upcoming spec-decode verify-rollback paths in higgs-engine that operate on AnyCache rather than reaching into per-layer types directly. Tests: - any_cache_trim_by_kv_dispatches_to_each_layer: KV variant with mixed Some/None layers, no panic, all caches saturate to 0. - any_cache_trim_by_hybrid_skips_arrays_layers: Hybrid with KV+Arrays mix, asserts Arrays.offset stays unchanged (recurrent state cannot trim by offset alone). Suite: 332 passed, 0 failed, 24 ignored. Clippy + rustfmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports pure-logic speculative-decoding primitives from feat/ane-spec-decode (efa05de + prefill method from 293793a) into higgs-engine: - accept_prefix() — prefix-match acceptance logic - speculative_step() — one draft→verify→accept cycle - speculative_loop() — run until EOS/max_tokens - DraftModel trait: prefill, draft, advance, rollback (all Send) MlxDraftModel CoreML impl intentionally not ported; magic-canvas has native AneBonsaiEngine / AneArDecodeEngine / finalize_ane_gdn_inline paths instead. Bonsai shim lands separately. 19 unit tests (accept_prefix 7, MockDraft 3, step 5, loop 4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ports config plumbing from feat/ane-spec-decode 8e3a5b8: - ServeArgs gains --draft-model (Option<String>) and --num-draft (usize, default 8) CLI flags. - ModelConfig gains draft_model + num_draft fields, threaded through build_simple_config, load_config_file, and ensure_auto_router_model. - Test fixtures updated across config.rs, doctor.rs, tui/mod.rs. No load-path wiring yet — that arrives with the simple.rs engine port. MlxDraftModel from the source commit deliberately omitted; magic-canvas uses AneBonsaiDraftModel and native paths instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@dusterbloom would you mind checking the prs for bloat? This committed your deps and stuff |
Ports doctor validation from feat/ane-spec-decode 2a7480d: - check_draft_models() — resolves ModelConfig.draft_model path (FAIL if not found) and warns when a draft model is paired with batch=true (speculative decoding is SimpleEngine-only). - Documents draft_model / num_draft in the daemon init template and README.md config reference. 3 new doctor tests covering: unresolvable draft path, batch=true warn, no-draft-model short-circuit. Doctor suite now 29 tests (was 26). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cherry-picked PR-4 commits had a file-level `#![allow(clippy::unwrap_used,
clippy::indexing_slicing, clippy::panic)]` on speculative.rs as a quick
fix to satisfy strict workspace deny lints. That's a workaround, not how
senior Rust code is written.
Refactor:
- `accept_prefix`: replace `for i in 0..k { target_ids[i] }` indexing with
`target_ids.iter().zip(draft_ids.iter())`. Lock-step iteration, types
enforce bounds, no library indexing. The K+1th token (the bonus when
every draft matched) is appended via `target_ids.last()` after the loop.
- File-level allow: deleted. Library code is clippy-clean.
- Test module allow: scoped tightly to `#[cfg(test)] mod tests` only —
`#[allow(clippy::panic, clippy::unwrap_used, clippy::indexing_slicing,
clippy::unreachable)]`. Tests legitimately use these (failure → test
fail, which is the intent).
doctor.rs new tests: introduce `test_model_config(path)` helper instead
of 13-field literals duplicated 3×. DRY test setup with struct-update
syntax overrides for the field each test cares about.
Verify: `cargo clippy --all-targets --all-features` clean,
`cargo test -p higgs-engine --lib speculative -- --test-threads=1` 19/19,
`cargo test -p higgs --lib doctor -- --test-threads=1` 39/39.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
48d31f9 to
96d0059
Compare
|
@panbanda cleaned up! |
Summary
Adds the
DraftModeltrait, spec-decode primitives (speculative_step,speculative_loop,accept_prefix), config plumbing, and doctor validation. Pure infrastructure — no concrete drafter is wired in this PR; a follow-up adds the Bonsai-Q1 binding once PR #142 lands.What lands
crates/higgs-engine/src/speculative.rs(NEW, ~360 lines):trait DraftModel: Send,accept_prefix,speculative_step,speculative_loop, 19 testsModelConfig.draft_model: Option<String>+num_draft: usizefields--draft-model/--num-draftCLI flagsdoctor::check_draft_models— path resolution +batch=truewarn (3 tests)higgs inittemplate documents the new flagsSenior-Rust hygiene
#![allow(...)].accept_prefixusesiter().zip()lockstep — types enforce bounds, no[i]indexing#[cfg(test)] mod testsonlyStacked on PR #143
This PR depends on
AnyCache::trim_byfrom #143 (panbanda PR-3). Merging #143 first reduces this diff. Validated on the dusterbloom fork: dusterbloom#15Test plan
cargo clippy --all-targets --all-features— cleancargo fmt --all -- --check— cleancargo test -p higgs-engine --lib speculative -- --test-threads=1— 19 passedcargo test -p higgs --lib doctor -- --test-threads=1— 39 passedNotes
cargo run -p higgs serve --draft-model <path>will fail at engine load until the follow-up wires a concrete drafter — the trait + plumbing are useful for downstream consumers without that.