feat(cpp-server): thinking-budget v2 + multi-dialect reasoning aliases + spec#269
Merged
Conversation
Contributor
There was a problem hiding this comment.
2 issues found across 9 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
63bd7a8 to
9b8fbb8
Compare
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 24, 2026
`maybe_force_close`'s `remaining = n_gen - committed_now` mixed two coordinate systems: `committed_now` is the absolute KV position (= prompt_len + tokens generated this AR call), while `n_gen` is gen-only. The subtraction treated prompt-length tokens as if they were generated output, firing `--hard-limit-reply-budget` `prompt_len` tokens early on every prompted request. For the spec-decode → AR tail-off path, `n_gen` is remapped to remaining-budget, so the subtraction could go negative and force-close immediately as AR took over — defeating the hard-limit reply budget. Fix: capture `committed_at_entry = committed` at entry, compute `generated = committed_now - committed_at_entry`, then `remaining = n_gen - generated`. Both call sites (post-prefill AR fallback at line 919 and spec-decode tail-off at line 985) now have correct budget arithmetic in the "generated since entry" frame. Caught by codex review on PR Luce-Org#269.
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 24, 2026
…tracked split `finish_details.thinking_tokens` and `usage.completion_tokens_details.reasoning_tokens` used `phase1_tokens.size()` directly. When the model naturally emits `</think>` mid-stream and continues writing visible content, all of those post-close tokens were still being counted as thinking, and `content_tokens` was reporting only the phase-2 output — so the split mis-matched the `reasoning_content` / `content` strings the same response carried. Fix: expose `SseEmitter::mode()` and have the non-streaming `feed_tokens` lambda increment per-mode counters based on the emitter's mode BEFORE each `emit_token` call. Tokens that trigger the `</think>` transition are attributed to REASONING (they carry the close tag); the next token is the first CONTENT. Use those counters for both `finish_details.thinking_tokens`/`content_tokens` and `usage.completion_tokens_details.reasoning_tokens`. Caught by codex review on PR Luce-Org#269.
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 24, 2026
`stats()` and `full_stats()` read `lifetime_hits_`, `full_lifetime_hits_`, and `full_disk_bytes_` from a client thread while the daemon thread mutates them via `lookup()` / `lookup_full()`. In C++ that's a data-race / UB, not just a tearable value (codex review). Fix: declare the three counters `std::atomic<int64_t>`, use `fetch_add(1, memory_order_relaxed)` on writes and `.load(relaxed)` on reads. Relaxed ordering is sufficient — no synchronization with other state is required. `entries_.size()` / `full_entries_.size()` are still read without a lock; that's a single integer load on all libstdc++/libc++ targets and matches the Python impl's tear-tolerant introspection contract. Acceptable for an ops dashboard; documented in the spec header comment alongside `stats()`/`full_stats()`. Caught by codex review on PR Luce-Org#269.
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 24, 2026
`/v1/messages` and `/v1/messages/count_tokens` carried near-identical
copies of Anthropic's `system`-field normalization — both must accept
string or typed-block-array shapes, strip Claude Code's
`x-anthropic-billing-header:` blocks, then prepend the remainder as a
`{role:"system", content:...}` message. Duplicate logic risked drift
between generation and token counting.
Extract `normalize_anthropic_system(body, messages)` and replace both
call sites. Net change: -13 lines, one canonical implementation.
Caught by cubic review on PR Luce-Org#269.
952d04c to
cb33fb2
Compare
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 24, 2026
…ffort + per-request controls Folds in the codex re-review fixes and new design surface from PR Luce-Org#269 (pr/server tip cb33fb2): - share/model_cards/qwen3.6-27b.json sidecar (32768 / 81920 from the HF card) + README. Server reads at startup via GGUF general.name lookup. - dflash/src/server/model_card.{h,cpp}: sidecar parser, per-family fallback table (Qwen3.5/3.6, Gemma4, Laguna), hard fallback to antirez/ds4 ds4_eval.c reference values, effort-tier formulas. - ServerConfig: EffortTiers, SamplingDefaults, model_card_source_label. CLI flags --reasoning-effort-{low,medium,high,x-high,max} override per-tier values; banner reports source per knob. - Request parser: thinking.budget_tokens + thinking.reply_budget (clamped to server ceiling, never looser); 5-tier reasoning.effort with x-high/max as dflash extensions of OpenAI's 3-tier vocab; unknown effort tier falls back to high. §4.3 combined precedence: budget_tokens wins, effort still selects defaults. - GenerateResult::budget_forced_close flag: backends signal when L2 injected </think>; HTTP server uses it to attribute close_kind="hard" (was previously a text-grep against decoded phase1 which couldn't distinguish injected from natural close). - L1 phase-2 reprompt: emit only </think> through the REASONING→CONTENT transition (the prior "\n\nFinal answer: " scaffolding leak prefixed every hard-closed response). - Gemma4 tail-off: seed out_tokens with last_tok when empty before the iter-0 do_decode call (eliminates UB on small-budget requests). - --max-tokens legacy CLI flag now wired as documented alias for --default-max-tokens (was parsed to a dead field). - Spec rewritten to ~500 lines covering: model-card resolution order, sidecar JSON shape with explicit reasoning_effort_tiers override, 5-tier effort vocab, per-request clamp rules, why client controls are bounded rather than full overrides. CMakeLists.txt adds model_card.cpp to dflash_server + test_server_unit targets. test_server_unit 1426/1426 PASS on pr/server build.
269f4ef to
109cebd
Compare
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 25, 2026
…e] log
Snapshot/bench tooling can now read the full effective runtime config
off /props.runtime in one shot. Three new fields under runtime:
chunk — prefill chunk size (bargs.chunk)
target_device — resolved target-model placement ("auto:0", "cuda:0")
draft_device — resolved draft-model placement, null when no draft
Why: of 98 historical bench snapshots, only 1 had populated server_info
(post-c35a8a4). We just discovered that downstream ops have been
forcing --cache-type-k/v q4_0 since 2026-05-23 while the binary auto-
default is tq3_0 (when max_ctx>6144 on CUDA). Without /props capturing
the full runtime config, that 2-day divergence was invisible to
forensics. Adding chunk/target_device/draft_device closes the rest of
the introspection gap.
Companion: qwen35 do_ar_decode now emits an `[ar-decode] tokens=N
time=Ts speed=X tok/s` line to stderr per request — parity with the
existing `[spec-decode]` summary. AR fires when sampling needs logit
processing (temp != 0), which the existing log path didn't cover.
Spec updates:
docs/specs/openapi-props.yaml — Runtime schema extended; example
payload updated.
docs/specs/props-endpoint.md — new §4.16 documenting the full
runtime block (the six pre-existing + three new).
Test: test_props_runtime_shape pins the full field set + null-draft
case. 1497 assertions pass.
This is the PR-side port of the same patch on integration tip 3ed5062.
The integration tip also carries data snapshots (perf-sweep) and
operational docs (KV q4 vs tq3 A/B design, run-request addenda) which
are not in scope for PR Luce-Org#269.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 25, 2026
…-Org#269 PR Luce-Org#269 now carries chunk + target_device + draft_device fields in /props.runtime, the [ar-decode] log emission, openapi/props-endpoint spec updates, and test_props_runtime_shape. Same surface as the integration commit 3ed5062. This merge keeps the two branches aligned on the shared code/spec content. Integration retains its data snapshots + operational docs that are out of PR Luce-Org#269's scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 25, 2026
…1_cap fix Same surface as integration tip 069c412, code-only port — drops the operational snapshot + run-request docs that don't belong on PR Luce-Org#269. Three layers, all engine-owned: Phase A — hard_limit_reply_budget made per-model share/model_cards/_schema.json: optional `hard_limit_reply_budget` share/model_cards/qwen3.6-27b.json: declared at 4096 (verbose-post-close) dflash/src/server/model_card.cpp: read field from sidecar, family fallback for qwen35/36/3 → 4096 docs/specs/{thinking-budget.md,model-cards.md}: documented BUG FIX in http_server.cpp: phase1_cap = min(think_max + reply_budget, max_output). Without the `+ reply_budget` term, force-close fires at the FIRST AR iter when hard_limit equals think_max. Latent bug exposed by the hard_limit bump. Phase B — soft-limit negotiated close (spec §5.3) BudgetHook gains soft_limit_remaining + soft_limit_close_rank ModelCard / ServerConfig / sidecar / schema: soft_limit_reply_budget (default 0 = disabled) + soft_limit_close_rank (default 8) qwen35_backend do_ar_decode: pre-hard-limit branch peeks top-K of logits; if `</think>` ranks in top-K AND remaining is in the (hard, soft] window, accept the close. Ports ds4_eval.c:3030's two-tier strategy. GenerateResult.budget_soft_close → close_kind="soft" in finish_details Phase C — chat-template thinking_preamble (spec §3.3 / §5.4) Sidecar fields thinking_preamble (template with {think_max}/{reply_max} substitution) + thinking_preamble_format (comment|directive|none) Qwen3.6 sidecar ships an HTML-style comment preamble — Qwen3.6 has no trained native reasoning-effort signal, so we communicate budget via a comment in the reasoning region. Model self-plans. chat_template.cpp render_chat_template: injects preamble after the opening `<think>` when enable_thinking=true. Caller substitutes placeholders with EFFECTIVE per-request values (after §4.4 clamping), not server ceilings. Single-case demonstration (recNu3MXkvWUzHZr9 GPQA Diamond, correct=B): BEFORE given=C ✗ thk=3585 wall=244s (reply clipped by force-close) AFTER-A given=B ✓ thk=4097 wall=323s (the fix) AFTER-A+B+C given=B ✓ thk=0 wall=48s (model self-closed, 6.7× faster) Bench/snapshot tooling (post c35a8a4 + 3ed5062) wholesale-captures /props.budget_envelope so the per-config knobs land in result.json without further bench edits. Test surface: test_server_unit 1497 assertions / 0 failures across all phases (Phase A, A+B, A+B+C). No props_schema bump needed — all new fields are additive under budget_envelope + model_card. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 25, 2026
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 25, 2026
… (PR Luce-Org#269 candidate) The GEMMA4 case in chat_template.cpp didn't reference thinking at all, so: - With enable_thinking=false: model self-emits its own <|channel>thought\n...<channel|> sequence and the rendered text (e.g. the literal "thought" token + newlines) leaks into visible content. Reproduced on bragi qwen via plain prompts like "What is 2+2?" returning '\n4//thought\n\n4\nthought\n\n4'. - With enable_thinking=true: nothing in the prompt signals the model that it should route through the channel-thought block, so reasoning_content stays empty and the model never enters proper thinking mode. Fix per the chat template embedded in the Gemma4 GGUF metadata (google/gemma-4-26B-A4B-it): 1. enable_thinking=true → emit `<|think|>\n` at top of system turn 2. add_generation_prompt + !enable_thinking → append `<|channel>thought\n<channel|>` after `<|turn>model\n` so the model SKIPS its own thought channel (the Qwen3 `<think></think>` guard analog) 3. system turn now wraps system content properly instead of prepending it inline to the first user message (which broke when the first message wasn't user) cherry-pick: PR Luce-Org#269 (thinking-budget v2 series) — strictly a thinking-routing fix, no decode-path or model-arch change. Repro: bragi /v1/chat/completions with gemma-4-26b-A4B-it Q4_K_M; post-fix the same prompts should return clean content without any "thought" substring leaking. Run-request: dflash/docs/run-requests/bragi-gemma4-laguna-config-issues.md (item #1 and #2; #3 the smoke-mc ggml_mul_mat crash is a separate fix targeted at an independent PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 25, 2026
After landing the chat-template fix (f1d30f2) and rebuilding, the //thought leakage is mostly gone (Hello-world, Continue sequence, real prompts all return clean content). But a separate hard crash surfaces: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed at ggml.c:3243 → container exits with 139 immediately Reproduces deterministically with a single 169-token prompt (the first HumanEval case). Trivial prompts (<20 tokens) work fine. The crash is INDEPENDENT of the chat-template work — happens with no system message either. This is a tensor-shape mismatch in the gemma4 forward path (Gemma4Backend / gemma4_decode.cu), needs sindri's eye on which mul_mat call is asserting. This belongs in an independent PR, not the Luce-Org#269 thinking-budget series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 25, 2026
Update gemma4 RR to reflect that the chat-template fix (f1d30f2) is landed in PR Luce-Org#269 and the remaining ggml_can_mul_mat crash is being investigated by erik/Claude — will ship as an encapsulated independent PR. Add bisection notes: structure-specific (HumanEval prompt at 169 tokens crashes; 466 tokens of plain repeated text runs clean), single user message reproduces, 24 candidate mul_mat call sites in gemma4_graph.cpp. Stripped binary → next step is debug-symbol build or per-mul-mat name-tagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-card sidecars
Adds per-request thinking-budget controls, multi-dialect reasoning
emission, and JSON model-card sidecars to the native C++ dflash server.
## Thinking-budget mechanism
- Level 2 BudgetHook: backend AR / spec-decode injects the model's
`</think>` close-token sequence once when `n_gen − committed
≤ hard_limit_reply_budget`. KV-continuous, mid-stream, in-process —
applies uniformly to streaming and non-streaming requests.
- Degenerate-decode watchdog: detects post-close repetition / runaway
decode and aborts cleanly, surfacing the flag in finish_details and
bench output.
- Sidecar `thinking_terminator_hint` field: per-family close-token
text (e.g. `</think>\n\n` for qwen3, `<channel|>\n\n` for gemma4)
resolved at startup from the model card; `think_close_token_ids` is
populated by tokenizing the hint, so the BudgetHook is family-aware
without hardcoded archs.
- Resolution order documented in spec §3: CLI > sidecar > family
fallback > hard fallback. Hard fallback `hard_limit_reply_budget`
defaults to 4096 (raised from 512); terse models should override
down via sidecar.
## Multi-dialect reasoning emission
- SSE emitter splits reasoning ↔ content for OpenAI Chat
(`reasoning_content` delta), Anthropic (thinking/text content
blocks with separate lifecycle), and Responses API (reasoning
stripped per Codex r1 P2).
- Qwen3.6 `<think>` / `</think>` special token ids are forwarded as
text into the emitter; gemma4 `<|channel>` / `<channel|>` are
mapped onto the same channel so all archs share one state machine.
- `first_content_token_index()` derives the natural-close split from
the REASONING→CONTENT transition; leading `<think>` opener is
detected before fci capture so thinking_tokens accounts correctly
for the Qwen3.6 streamed-thinking path.
## /props endpoint
- Wholesale model_card (verbatim sidecar JSON, validates against
share/model_cards/_schema.json) + budget_envelope (effective
think_max_tokens, default_max_tokens, hard_limit_reply_budget,
effort_tiers, model_card_source label) + runtime fields (chunk,
target_device, draft_device, speculative_enabled, fa_window,
ddtree_budget, kv_cache_k/v, runtime_backend). Captured by
server_main at startup so the handler doesn't crack BackendArgs.
## Model-card sidecars
- share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,
laguna-xs.2}.json — each ships max_tokens, complex_problem_max_tokens,
hard_limit_reply_budget, thinking_terminator_hint, sampling defaults,
and reasoning_effort_tiers as applicable.
- share/model_cards/_schema.json — JSON Schema for sidecar
validation, exercised by server_main loader and shipped for
third-party authors.
- model_card.cpp resolver: keyword + family-fallback lookup, with
startup banner reporting the resolved `model_card_source` so
operators can confirm which envelope is in force.
## Spec docs
- docs/specs/thinking-budget.md — mechanism, resolution order, CLI
surface, finish_details/close_kind contract.
- docs/specs/model-cards.md — sidecar field reference, family
fallback table, ship/override guidance.
- docs/specs/props-endpoint.md + docs/specs/openapi-props.yaml —
shape and OpenAPI 3 schema for the /props payload.
## Tests
- test_server_unit gains coverage of the SSE emitter reasoning split
(OpenAI / Anthropic / Responses), first_content_token_index across
the natural-close, never-closed, content-only, and Qwen3.6
streamed-thinking paths, /props body shape (wholesale sidecar +
family-fallback null), and usage.timings across all three response
shapes. 1542 assertions, 0 failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves merge conflict in dflash/src/qwen35/qwen35_backend.cpp at the AR fallback call site. PR Luce-Org#269 added a BudgetHook + forced_close/degenerate output params to do_ar_decode(); upstream main introduced a run_ar_decode_path() wrapper. Both symbols coexist — the conflict was only at the call site inside generate_dflash(). Kept PR Luce-Org#269's direct do_ar_decode() invocation so the thinking-budget force-close still fires when spec-decode is unavailable; the upstream wrapper would have lost the hook. Tested on lucebox2 (RTX 3090, CUDA 12.6, sm_86): - cmake build clean for all PR Luce-Org#269 + PR Luce-Org#262 targets - dflash_server end-to-end on Qwen3.6-27B-Q4_K_M with reasoning.effort=low: BudgetHook fired at committed=770/1000 remaining=256 == hard_limit=256 finish_details.close_kind=hard, degenerate watchdog triggered cleanly Co-Authored-By: WOZCODE <contact@withwoz.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 26, 2026
Brings in 52 upstream commits since merge-base 8c23234 (2 days ago). The headline is PR Luce-Org#269 (`403e598 feat(cpp-server): thinking-budget v2 + multi-dialect reasoning + model-card sidecars`) — the squashed version of our own thinking-budget v2 work that we'd been carrying as 50+ small commits on this branch. Plus: - PR Luce-Org#262 howard0su/powerinfer: hybrid MoE spec-decode with DFlash draft, GPU/CPU FFN overlap, persistent pre-FFN graph for DeltaNet, 4-5x decode speedup, MoE perf telemetry - PR Luce-Org#263 weicj/feat-cpp-server-pflash-draft-placement: mixed-backend PFlash phase split - 648a6e2 perf: GPU-resident hybrid decode (eliminate PCIe round-trips) - d12ddde fix: dynamic placement uses --max-ctx instead of hardcoded 8192 Conflict resolution: where this branch has post-Luce-Org#269 refinements (gemma4 timings via 4e9abda, degenerate-decode watchdog via c2d725f / 8538ff9, laguna chat-template fix via 92f84cd, transition cue via 16bb31e, thinking-budget force-close fix via b86342d, etc.), keep ours. Where the only divergence is "this branch has 50 small commits that 403e598 squashes," accept the merged result. # Conflicts: # dflash/scripts/server.py # dflash/src/qwen35/qwen35_backend.cpp # dflash/src/qwen35/qwen35_backend.h # dflash/src/server/http_server.h # dflash/src/server/sse_emitter.cpp # dflash/test/test_server_unit.cpp # share/model_cards/laguna-xs.2.json
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 27, 2026
server_main.cpp had two identical 119-line blocks of the thinking-budget
v2 model-card resolution code (general.name / general.architecture
read, resolve_model_card(), ServerConfig application, tier clamping).
g++ errored out on redeclarations:
redeclaration of 'std::string general_name'
redeclaration of 'std::string general_arch'
redeclaration of 'dflash::common::ModelCard card'
redeclaration of 'const int tier_ceiling'
conflicting declaration 'auto clamp_tier'
The duplicate was a merge artifact from 1df9099 (luce-org/main into
integration/-clean post-rename). Upstream PR Luce-Org#269 squashed our pre-PR
work into 403e598 and our integration branch carried the same code in
unsquashed form; the 3-way merge kept both copies.
Also drop a redundant `#include "gguf.h"` (lines 23 and 25 of the same
file). Harmless thanks to include guards but ugly merge residue.
Build proceeds past cmake configure with the dedupe.
This was referenced May 27, 2026
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 27, 2026
…nch in-tree Collapses 134 commits of `integration/props-uv-squared-clean` onto current main as one reviewable change. Most of the underlying server-side work already landed via separate PRs: thinking-budget v2 + multi-dialect reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping (Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of smaller fixes from bragi over the last week. What remained in integration is everything *above* the server: the host-side runner, the container image, the benchmark/profile evidence pipeline, the harness for driving real clients, and the luce-bench framework itself. ## What changed ### Docker + host wrapper - Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/ into one image; wires `python -m lucebench.cli` as the `benchmark` entrypoint subcommand). - `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi): `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/ `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`. - `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub` tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`). - `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple targets are present in models/. ### lucebox Python package (in-container CLI) - `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier selection + per-host config writeback), `config.py` (typed TOML), `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`, `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`, `types.py`. - `lucebox/tests/` for the typed surfaces. - Level1/Level2/Level3 profile gates; sweep results merged back into `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/ agentic-session validation gates pass. ### luce-bench in monorepo - `luce-bench/` workspace member at v0.2.4 — the standalone bench framework (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot output; v0.2.4 includes the forge area's EvalConfig + run_scenario signature realignment). - `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior git tag pin. - `.github/workflows/release-luce-bench.yml` publishes to PyPI from the monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment). ### harness workspace - `harness/` workspace member: client adapters (`claude_code`, `codex`, `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`, `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile` delegates the actual bench runs to harness. ### Bench + profile evidence - `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile artifacts. Snapshots themselves live in the standalone `luce-bench-baselines` repo (out of this tree). ### Misc - Updated CI workflow path filters for `server/` (post-rename). - README's "Quick start" section, hardware coverage table, env var reference table; minor edits to optimizations READMEs. - model card sidecar updates landed alongside Luce-Org#269 but kept here at current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2, `_schema.json`). ## Out of scope / follow-ups - 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json` shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode path already proven). - gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not applied to gemma4 yet). - Multi-Token Prediction (upstream PR #23398, draft). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 27, 2026
…nch in-tree Collapses 134 commits of `integration/props-uv-squared-clean` onto current main as one reviewable change. Most of the underlying server-side work already landed via separate PRs: thinking-budget v2 + multi-dialect reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping (Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of smaller fixes from bragi over the last week. What remained in integration is everything *above* the server: the host-side runner, the container image, the benchmark/profile evidence pipeline, the harness for driving real clients, and the luce-bench framework itself. ## What changed ### Docker + host wrapper - Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/ into one image; wires `python -m lucebench.cli` as the `benchmark` entrypoint subcommand). - `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi): `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/ `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`. - `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub` tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`). - `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple targets are present in models/. ### lucebox Python package (in-container CLI) - `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier selection + per-host config writeback), `config.py` (typed TOML), `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`, `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`, `types.py`. - `lucebox/tests/` for the typed surfaces. - Level1/Level2/Level3 profile gates; sweep results merged back into `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/ agentic-session validation gates pass. ### luce-bench in monorepo - `luce-bench/` workspace member at v0.2.4 — the standalone bench framework (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot output; v0.2.4 includes the forge area's EvalConfig + run_scenario signature realignment). - `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior git tag pin. - `.github/workflows/release-luce-bench.yml` publishes to PyPI from the monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment). ### harness workspace - `harness/` workspace member: client adapters (`claude_code`, `codex`, `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`, `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile` delegates the actual bench runs to harness. ### Bench + profile evidence - `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile artifacts. Snapshots themselves live in the standalone `luce-bench-baselines` repo (out of this tree). ### Misc - Updated CI workflow path filters for `server/` (post-rename). - README's "Quick start" section, hardware coverage table, env var reference table; minor edits to optimizations READMEs. - model card sidecar updates landed alongside Luce-Org#269 but kept here at current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2, `_schema.json`). ## Out of scope / follow-ups - 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json` shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode path already proven). - gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not applied to gemma4 yet). - Multi-Token Prediction (upstream PR #23398, draft). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 27, 2026
…nch in-tree Collapses 134 commits of `integration/props-uv-squared-clean` onto current main as one reviewable change. Most of the underlying server-side work already landed via separate PRs: thinking-budget v2 + multi-dialect reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping (Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of smaller fixes from bragi over the last week. What remained in integration is everything *above* the server: the host-side runner, the container image, the benchmark/profile evidence pipeline, the harness for driving real clients, and the luce-bench framework itself. ## What changed ### Docker + host wrapper - Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/ into one image; wires `python -m lucebench.cli` as the `benchmark` entrypoint subcommand). - `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi): `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/ `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`. - `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub` tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`). - `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple targets are present in models/. ### lucebox Python package (in-container CLI) - `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier selection + per-host config writeback), `config.py` (typed TOML), `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`, `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`, `types.py`. - `lucebox/tests/` for the typed surfaces. - Level1/Level2/Level3 profile gates; sweep results merged back into `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/ agentic-session validation gates pass. ### luce-bench in monorepo - `luce-bench/` workspace member at v0.2.4 — the standalone bench framework (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot output; v0.2.4 includes the forge area's EvalConfig + run_scenario signature realignment). - `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior git tag pin. - `.github/workflows/release-luce-bench.yml` publishes to PyPI from the monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment). ### harness workspace - `harness/` workspace member: client adapters (`claude_code`, `codex`, `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`, `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile` delegates the actual bench runs to harness. ### Bench + profile evidence - `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile artifacts. Snapshots themselves live in the standalone `luce-bench-baselines` repo (out of this tree). ### Misc - Updated CI workflow path filters for `server/` (post-rename). - README's "Quick start" section, hardware coverage table, env var reference table; minor edits to optimizations READMEs. - model card sidecar updates landed alongside Luce-Org#269 but kept here at current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2, `_schema.json`). ## Out of scope / follow-ups - 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json` shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode path already proven). - gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not applied to gemma4 yet). - Multi-Token Prediction (upstream PR #23398, draft). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds per-request thinking-budget controls, multi-dialect reasoning emission, and JSON model-card sidecars to the native C++ dflash server.
What changed
</think>close-token sequence once whenn_gen − committed ≤ hard_limit_reply_budget— KV-continuous, mid-stream, in-process, uniform across streaming and non-streaming. Degenerate-decode watchdog detects post-close runaway and aborts cleanly, surfaced as a flag infinish_detailsand bench output. Sidecarthinking_terminator_hint↔ tokenizer resolvesthink_close_token_idsat startup so the hook is family-aware without hardcoded archs. Resolution order: CLI > sidecar > family fallback > hard fallback (defaulthard_limit_reply_budget= 4096; terse models override down).reasoning_contentdelta), Anthropic (separate thinking / text content-block lifecycle), and Responses API (reasoning stripped per Codex r1 P2). Qwen3.6<think>/</think>special-token ids are forwarded as text into the emitter; gemma4<|channel>/<channel|>are mapped onto the same channel so all archs share one state machine.first_content_token_index()derives the natural-close split from the REASONING→CONTENT transition; leading<think>opener is detected up-front sothinking_tokensaccounts correctly for the Qwen3.6 streamed-thinking path.model_card(verbatim sidecar JSON, validates againstshare/model_cards/_schema.json) +budget_envelope(effectivethink_max_tokens,default_max_tokens,hard_limit_reply_budget,effort_tiers,model_card_sourcelabel) +runtimefields (chunk,target_device,draft_device,speculative_enabled,fa_window,ddtree_budget,kv_cache_k/v,runtime_backend). Captured byserver_mainat startup so the handler doesn't crackBackendArgs.share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.jsoneach shipmax_tokens,complex_problem_max_tokens,hard_limit_reply_budget,thinking_terminator_hint, sampling defaults, andreasoning_effort_tiersas applicable.share/model_cards/_schema.jsonprovides the JSON Schema for validation.model_card.cppresolver does keyword + family-fallback lookup, with the startup banner reporting the resolvedmodel_card_sourceso operators can confirm which envelope is in force.docs/specs/thinking-budget.md(mechanism, resolution order, CLI surface,finish_details/close_kindcontract),docs/specs/model-cards.md(sidecar field reference, family fallback table, ship/override guidance),docs/specs/props-endpoint.md+docs/specs/openapi-props.yaml(shape + OpenAPI 3 schema for the/propspayload).Validation
cmake --build dflash/build -jclean (warnings only on pre-existing-Wunused-resultfreadcalls in unrelated test code).ctest -V -R server_unit→ 1542 assertions, 0 failures, including:first_content_token_indexacross natural-close, never-closed, content-only, and Qwen3.6 streamed-thinking (new regression test added in this PR — leading<think>opener no longer capturesfci=0)./propsbody shape (wholesale sidecar + family-fallback null).usage.timingsacross all three response shapes.dflash/src/server/sse_emitter.cppto pre-fix state makestest_emitter_first_content_index_qwen36_streaming_thinkingfail withem.first_content_token_index() > 0.Out of scope / follow-ups
share/model_cards/gemma-4-31b-it.jsonships but the dense-30.7B backend path is not yet integrated; the sidecar is in place for when it lands.gemma-4-26b-a4b-itruns end-to-end at this PR's level of integration, but the MoE expert-split optimization is howard0su's PR Split MoE weights between CPU & CUDA, support qwen35moe models #262 territory.