feat(cpp-server): thinking-budget v2 + multi-dialect reasoning aliases + spec by easel · Pull Request #269 · Luce-Org/lucebox-hub

easel · 2026-05-23T23:57:49Z

Adds per-request thinking-budget controls, multi-dialect reasoning emission, and JSON model-card sidecars to the native C++ dflash server.

What changed

Thinking-budget mechanism (Level 2 BudgetHook + watchdog). Backend AR / spec-decode injects the model's </think> close-token sequence once when n_gen − committed ≤ hard_limit_reply_budget — KV-continuous, mid-stream, in-process, uniform across streaming and non-streaming. Degenerate-decode watchdog detects post-close runaway and aborts cleanly, surfaced as a flag in finish_details and bench output. Sidecar thinking_terminator_hint ↔ tokenizer resolves think_close_token_ids at startup so the hook is family-aware without hardcoded archs. Resolution order: CLI > sidecar > family fallback > hard fallback (default hard_limit_reply_budget = 4096; terse models override down).
Multi-dialect reasoning emission. SSE emitter splits reasoning ↔ content for OpenAI Chat (reasoning_content delta), Anthropic (separate thinking / text content-block lifecycle), and Responses API (reasoning stripped per Codex r1 P2). Qwen3.6 <think> / </think> special-token ids are forwarded as text into the emitter; gemma4 <|channel> / <channel|> are mapped onto the same channel so all archs share one state machine. first_content_token_index() derives the natural-close split from the REASONING→CONTENT transition; leading <think> opener is detected up-front so thinking_tokens accounts correctly for the Qwen3.6 streamed-thinking path.
/props endpoint. Wholesale model_card (verbatim sidecar JSON, validates against share/model_cards/_schema.json) + budget_envelope (effective think_max_tokens, default_max_tokens, hard_limit_reply_budget, effort_tiers, model_card_source label) + runtime fields (chunk, target_device, draft_device, speculative_enabled, fa_window, ddtree_budget, kv_cache_k/v, runtime_backend). Captured by server_main at startup so the handler doesn't crack BackendArgs.
Model-card sidecars. share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.json each ship max_tokens, complex_problem_max_tokens, hard_limit_reply_budget, thinking_terminator_hint, sampling defaults, and reasoning_effort_tiers as applicable. share/model_cards/_schema.json provides the JSON Schema for validation. model_card.cpp resolver does keyword + family-fallback lookup, with the startup banner reporting the resolved model_card_source so operators can confirm which envelope is in force.
Spec docs. docs/specs/thinking-budget.md (mechanism, resolution order, CLI surface, finish_details / close_kind contract), docs/specs/model-cards.md (sidecar field reference, family fallback table, ship/override guidance), docs/specs/props-endpoint.md + docs/specs/openapi-props.yaml (shape + OpenAPI 3 schema for the /props payload).

Validation

cmake --build dflash/build -j clean (warnings only on pre-existing -Wunused-result fread calls in unrelated test code).
ctest -V -R server_unit → 1542 assertions, 0 failures, including:
- SSE emitter reasoning split across OpenAI Chat / Anthropic / Responses.
- first_content_token_index across natural-close, never-closed, content-only, and Qwen3.6 streamed-thinking (new regression test added in this PR — leading <think> opener no longer captures fci=0).
- /props body shape (wholesale sidecar + family-fallback null).
- usage.timings across all three response shapes.
Confirmed the regression test catches the underlying bug: reverting dflash/src/server/sse_emitter.cpp to pre-fix state makes test_emitter_first_content_index_qwen36_streaming_thinking fail with em.first_content_token_index() > 0.

Out of scope / follow-ups

31b backend wiring. share/model_cards/gemma-4-31b-it.json ships but the dense-30.7B backend path is not yet integrated; the sidecar is in place for when it lands.
gemma4 MoE expert-split. gemma-4-26b-a4b-it runs end-to-end at this PR's level of integration, but the MoE expert-split optimization is howard0su's PR Split MoE weights between CPU & CUDA, support qwen35moe models #262 territory.
MTP (Multi-Token Prediction). Tracked by upstream PR #23398; orthogonal to the thinking-budget / sidecar work here.

cubic-dev-ai

2 issues found across 9 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

`maybe_force_close`'s `remaining = n_gen - committed_now` mixed two coordinate systems: `committed_now` is the absolute KV position (= prompt_len + tokens generated this AR call), while `n_gen` is gen-only. The subtraction treated prompt-length tokens as if they were generated output, firing `--hard-limit-reply-budget` `prompt_len` tokens early on every prompted request. For the spec-decode → AR tail-off path, `n_gen` is remapped to remaining-budget, so the subtraction could go negative and force-close immediately as AR took over — defeating the hard-limit reply budget. Fix: capture `committed_at_entry = committed` at entry, compute `generated = committed_now - committed_at_entry`, then `remaining = n_gen - generated`. Both call sites (post-prefill AR fallback at line 919 and spec-decode tail-off at line 985) now have correct budget arithmetic in the "generated since entry" frame. Caught by codex review on PR Luce-Org#269.

…tracked split `finish_details.thinking_tokens` and `usage.completion_tokens_details.reasoning_tokens` used `phase1_tokens.size()` directly. When the model naturally emits `</think>` mid-stream and continues writing visible content, all of those post-close tokens were still being counted as thinking, and `content_tokens` was reporting only the phase-2 output — so the split mis-matched the `reasoning_content` / `content` strings the same response carried. Fix: expose `SseEmitter::mode()` and have the non-streaming `feed_tokens` lambda increment per-mode counters based on the emitter's mode BEFORE each `emit_token` call. Tokens that trigger the `</think>` transition are attributed to REASONING (they carry the close tag); the next token is the first CONTENT. Use those counters for both `finish_details.thinking_tokens`/`content_tokens` and `usage.completion_tokens_details.reasoning_tokens`. Caught by codex review on PR Luce-Org#269.

`stats()` and `full_stats()` read `lifetime_hits_`, `full_lifetime_hits_`, and `full_disk_bytes_` from a client thread while the daemon thread mutates them via `lookup()` / `lookup_full()`. In C++ that's a data-race / UB, not just a tearable value (codex review). Fix: declare the three counters `std::atomic<int64_t>`, use `fetch_add(1, memory_order_relaxed)` on writes and `.load(relaxed)` on reads. Relaxed ordering is sufficient — no synchronization with other state is required. `entries_.size()` / `full_entries_.size()` are still read without a lock; that's a single integer load on all libstdc++/libc++ targets and matches the Python impl's tear-tolerant introspection contract. Acceptable for an ops dashboard; documented in the spec header comment alongside `stats()`/`full_stats()`. Caught by codex review on PR Luce-Org#269.

`/v1/messages` and `/v1/messages/count_tokens` carried near-identical copies of Anthropic's `system`-field normalization — both must accept string or typed-block-array shapes, strip Claude Code's `x-anthropic-billing-header:` blocks, then prepend the remainder as a `{role:"system", content:...}` message. Duplicate logic risked drift between generation and token counting. Extract `normalize_anthropic_system(body, messages)` and replace both call sites. Net change: -13 lines, one canonical implementation. Caught by cubic review on PR Luce-Org#269.

…ffort + per-request controls Folds in the codex re-review fixes and new design surface from PR Luce-Org#269 (pr/server tip cb33fb2): - share/model_cards/qwen3.6-27b.json sidecar (32768 / 81920 from the HF card) + README. Server reads at startup via GGUF general.name lookup. - dflash/src/server/model_card.{h,cpp}: sidecar parser, per-family fallback table (Qwen3.5/3.6, Gemma4, Laguna), hard fallback to antirez/ds4 ds4_eval.c reference values, effort-tier formulas. - ServerConfig: EffortTiers, SamplingDefaults, model_card_source_label. CLI flags --reasoning-effort-{low,medium,high,x-high,max} override per-tier values; banner reports source per knob. - Request parser: thinking.budget_tokens + thinking.reply_budget (clamped to server ceiling, never looser); 5-tier reasoning.effort with x-high/max as dflash extensions of OpenAI's 3-tier vocab; unknown effort tier falls back to high. §4.3 combined precedence: budget_tokens wins, effort still selects defaults. - GenerateResult::budget_forced_close flag: backends signal when L2 injected </think>; HTTP server uses it to attribute close_kind="hard" (was previously a text-grep against decoded phase1 which couldn't distinguish injected from natural close). - L1 phase-2 reprompt: emit only </think> through the REASONING→CONTENT transition (the prior "\n\nFinal answer: " scaffolding leak prefixed every hard-closed response). - Gemma4 tail-off: seed out_tokens with last_tok when empty before the iter-0 do_decode call (eliminates UB on small-budget requests). - --max-tokens legacy CLI flag now wired as documented alias for --default-max-tokens (was parsed to a dead field). - Spec rewritten to ~500 lines covering: model-card resolution order, sidecar JSON shape with explicit reasoning_effort_tiers override, 5-tier effort vocab, per-request clamp rules, why client controls are bounded rather than full overrides. CMakeLists.txt adds model_card.cpp to dflash_server + test_server_unit targets. test_server_unit 1426/1426 PASS on pr/server build.

…e] log Snapshot/bench tooling can now read the full effective runtime config off /props.runtime in one shot. Three new fields under runtime: chunk — prefill chunk size (bargs.chunk) target_device — resolved target-model placement ("auto:0", "cuda:0") draft_device — resolved draft-model placement, null when no draft Why: of 98 historical bench snapshots, only 1 had populated server_info (post-c35a8a4). We just discovered that downstream ops have been forcing --cache-type-k/v q4_0 since 2026-05-23 while the binary auto- default is tq3_0 (when max_ctx>6144 on CUDA). Without /props capturing the full runtime config, that 2-day divergence was invisible to forensics. Adding chunk/target_device/draft_device closes the rest of the introspection gap. Companion: qwen35 do_ar_decode now emits an `[ar-decode] tokens=N time=Ts speed=X tok/s` line to stderr per request — parity with the existing `[spec-decode]` summary. AR fires when sampling needs logit processing (temp != 0), which the existing log path didn't cover. Spec updates: docs/specs/openapi-props.yaml — Runtime schema extended; example payload updated. docs/specs/props-endpoint.md — new §4.16 documenting the full runtime block (the six pre-existing + three new). Test: test_props_runtime_shape pins the full field set + null-draft case. 1497 assertions pass. This is the PR-side port of the same patch on integration tip 3ed5062. The integration tip also carries data snapshots (perf-sweep) and operational docs (KV q4 vs tq3 A/B design, run-request addenda) which are not in scope for PR Luce-Org#269. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-Org#269 PR Luce-Org#269 now carries chunk + target_device + draft_device fields in /props.runtime, the [ar-decode] log emission, openapi/props-endpoint spec updates, and test_props_runtime_shape. Same surface as the integration commit 3ed5062. This merge keeps the two branches aligned on the shared code/spec content. Integration retains its data snapshots + operational docs that are out of PR Luce-Org#269's scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1_cap fix Same surface as integration tip 069c412, code-only port — drops the operational snapshot + run-request docs that don't belong on PR Luce-Org#269. Three layers, all engine-owned: Phase A — hard_limit_reply_budget made per-model share/model_cards/_schema.json: optional `hard_limit_reply_budget` share/model_cards/qwen3.6-27b.json: declared at 4096 (verbose-post-close) dflash/src/server/model_card.cpp: read field from sidecar, family fallback for qwen35/36/3 → 4096 docs/specs/{thinking-budget.md,model-cards.md}: documented BUG FIX in http_server.cpp: phase1_cap = min(think_max + reply_budget, max_output). Without the `+ reply_budget` term, force-close fires at the FIRST AR iter when hard_limit equals think_max. Latent bug exposed by the hard_limit bump. Phase B — soft-limit negotiated close (spec §5.3) BudgetHook gains soft_limit_remaining + soft_limit_close_rank ModelCard / ServerConfig / sidecar / schema: soft_limit_reply_budget (default 0 = disabled) + soft_limit_close_rank (default 8) qwen35_backend do_ar_decode: pre-hard-limit branch peeks top-K of logits; if `</think>` ranks in top-K AND remaining is in the (hard, soft] window, accept the close. Ports ds4_eval.c:3030's two-tier strategy. GenerateResult.budget_soft_close → close_kind="soft" in finish_details Phase C — chat-template thinking_preamble (spec §3.3 / §5.4) Sidecar fields thinking_preamble (template with {think_max}/{reply_max} substitution) + thinking_preamble_format (comment|directive|none) Qwen3.6 sidecar ships an HTML-style comment preamble — Qwen3.6 has no trained native reasoning-effort signal, so we communicate budget via a comment in the reasoning region. Model self-plans. chat_template.cpp render_chat_template: injects preamble after the opening `<think>` when enable_thinking=true. Caller substitutes placeholders with EFFECTIVE per-request values (after §4.4 clamping), not server ceilings. Single-case demonstration (recNu3MXkvWUzHZr9 GPQA Diamond, correct=B): BEFORE given=C ✗ thk=3585 wall=244s (reply clipped by force-close) AFTER-A given=B ✓ thk=4097 wall=323s (the fix) AFTER-A+B+C given=B ✓ thk=0 wall=48s (model self-closed, 6.7× faster) Bench/snapshot tooling (post c35a8a4 + 3ed5062) wholesale-captures /props.budget_envelope so the per-config knobs land in result.json without further bench edits. Test surface: test_server_unit 1497 assertions / 0 failures across all phases (Phase A, A+B, A+B+C). No props_schema bump needed — all new fields are additive under budget_envelope + model_card. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-Org#269

… (PR Luce-Org#269 candidate) The GEMMA4 case in chat_template.cpp didn't reference thinking at all, so: - With enable_thinking=false: model self-emits its own <|channel>thought\n...<channel|> sequence and the rendered text (e.g. the literal "thought" token + newlines) leaks into visible content. Reproduced on bragi qwen via plain prompts like "What is 2+2?" returning '\n4//thought\n\n4\nthought\n\n4'. - With enable_thinking=true: nothing in the prompt signals the model that it should route through the channel-thought block, so reasoning_content stays empty and the model never enters proper thinking mode. Fix per the chat template embedded in the Gemma4 GGUF metadata (google/gemma-4-26B-A4B-it): 1. enable_thinking=true → emit `<|think|>\n` at top of system turn 2. add_generation_prompt + !enable_thinking → append `<|channel>thought\n<channel|>` after `<|turn>model\n` so the model SKIPS its own thought channel (the Qwen3 `<think></think>` guard analog) 3. system turn now wraps system content properly instead of prepending it inline to the first user message (which broke when the first message wasn't user) cherry-pick: PR Luce-Org#269 (thinking-budget v2 series) — strictly a thinking-routing fix, no decode-path or model-arch change. Repro: bragi /v1/chat/completions with gemma-4-26b-A4B-it Q4_K_M; post-fix the same prompts should return clean content without any "thought" substring leaking. Run-request: dflash/docs/run-requests/bragi-gemma4-laguna-config-issues.md (item #1 and #2; #3 the smoke-mc ggml_mul_mat crash is a separate fix targeted at an independent PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After landing the chat-template fix (f1d30f2) and rebuilding, the //thought leakage is mostly gone (Hello-world, Continue sequence, real prompts all return clean content). But a separate hard crash surfaces: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed at ggml.c:3243 → container exits with 139 immediately Reproduces deterministically with a single 169-token prompt (the first HumanEval case). Trivial prompts (<20 tokens) work fine. The crash is INDEPENDENT of the chat-template work — happens with no system message either. This is a tensor-shape mismatch in the gemma4 forward path (Gemma4Backend / gemma4_decode.cu), needs sindri's eye on which mul_mat call is asserting. This belongs in an independent PR, not the Luce-Org#269 thinking-budget series. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update gemma4 RR to reflect that the chat-template fix (f1d30f2) is landed in PR Luce-Org#269 and the remaining ggml_can_mul_mat crash is being investigated by erik/Claude — will ship as an encapsulated independent PR. Add bisection notes: structure-specific (HumanEval prompt at 169 tokens crashes; 466 tokens of plain repeated text runs clean), single user message reproduces, 24 candidate mul_mat call sites in gemma4_graph.cpp. Stripped binary → next step is debug-symbol build or per-mul-mat name-tagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l-card sidecars Adds per-request thinking-budget controls, multi-dialect reasoning emission, and JSON model-card sidecars to the native C++ dflash server. ## Thinking-budget mechanism - Level 2 BudgetHook: backend AR / spec-decode injects the model's `</think>` close-token sequence once when `n_gen − committed ≤ hard_limit_reply_budget`. KV-continuous, mid-stream, in-process — applies uniformly to streaming and non-streaming requests. - Degenerate-decode watchdog: detects post-close repetition / runaway decode and aborts cleanly, surfacing the flag in finish_details and bench output. - Sidecar `thinking_terminator_hint` field: per-family close-token text (e.g. `</think>\n\n` for qwen3, `<channel|>\n\n` for gemma4) resolved at startup from the model card; `think_close_token_ids` is populated by tokenizing the hint, so the BudgetHook is family-aware without hardcoded archs. - Resolution order documented in spec §3: CLI > sidecar > family fallback > hard fallback. Hard fallback `hard_limit_reply_budget` defaults to 4096 (raised from 512); terse models should override down via sidecar. ## Multi-dialect reasoning emission - SSE emitter splits reasoning ↔ content for OpenAI Chat (`reasoning_content` delta), Anthropic (thinking/text content blocks with separate lifecycle), and Responses API (reasoning stripped per Codex r1 P2). - Qwen3.6 `<think>` / `</think>` special token ids are forwarded as text into the emitter; gemma4 `<|channel>` / `<channel|>` are mapped onto the same channel so all archs share one state machine. - `first_content_token_index()` derives the natural-close split from the REASONING→CONTENT transition; leading `<think>` opener is detected before fci capture so thinking_tokens accounts correctly for the Qwen3.6 streamed-thinking path. ## /props endpoint - Wholesale model_card (verbatim sidecar JSON, validates against share/model_cards/_schema.json) + budget_envelope (effective think_max_tokens, default_max_tokens, hard_limit_reply_budget, effort_tiers, model_card_source label) + runtime fields (chunk, target_device, draft_device, speculative_enabled, fa_window, ddtree_budget, kv_cache_k/v, runtime_backend). Captured by server_main at startup so the handler doesn't crack BackendArgs. ## Model-card sidecars - share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it, laguna-xs.2}.json — each ships max_tokens, complex_problem_max_tokens, hard_limit_reply_budget, thinking_terminator_hint, sampling defaults, and reasoning_effort_tiers as applicable. - share/model_cards/_schema.json — JSON Schema for sidecar validation, exercised by server_main loader and shipped for third-party authors. - model_card.cpp resolver: keyword + family-fallback lookup, with startup banner reporting the resolved `model_card_source` so operators can confirm which envelope is in force. ## Spec docs - docs/specs/thinking-budget.md — mechanism, resolution order, CLI surface, finish_details/close_kind contract. - docs/specs/model-cards.md — sidecar field reference, family fallback table, ship/override guidance. - docs/specs/props-endpoint.md + docs/specs/openapi-props.yaml — shape and OpenAPI 3 schema for the /props payload. ## Tests - test_server_unit gains coverage of the SSE emitter reasoning split (OpenAI / Anthropic / Responses), first_content_token_index across the natural-close, never-closed, content-only, and Qwen3.6 streamed-thinking paths, /props body shape (wholesale sidecar + family-fallback null), and usage.timings across all three response shapes. 1542 assertions, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves merge conflict in dflash/src/qwen35/qwen35_backend.cpp at the AR fallback call site. PR Luce-Org#269 added a BudgetHook + forced_close/degenerate output params to do_ar_decode(); upstream main introduced a run_ar_decode_path() wrapper. Both symbols coexist — the conflict was only at the call site inside generate_dflash(). Kept PR Luce-Org#269's direct do_ar_decode() invocation so the thinking-budget force-close still fires when spec-decode is unavailable; the upstream wrapper would have lost the hook. Tested on lucebox2 (RTX 3090, CUDA 12.6, sm_86): - cmake build clean for all PR Luce-Org#269 + PR Luce-Org#262 targets - dflash_server end-to-end on Qwen3.6-27B-Q4_K_M with reasoning.effort=low: BudgetHook fired at committed=770/1000 remaining=256 == hard_limit=256 finish_details.close_kind=hard, degenerate watchdog triggered cleanly Co-Authored-By: WOZCODE <contact@withwoz.com>

Brings in 52 upstream commits since merge-base 8c23234 (2 days ago). The headline is PR Luce-Org#269 (`403e598 feat(cpp-server): thinking-budget v2 + multi-dialect reasoning + model-card sidecars`) — the squashed version of our own thinking-budget v2 work that we'd been carrying as 50+ small commits on this branch. Plus: - PR Luce-Org#262 howard0su/powerinfer: hybrid MoE spec-decode with DFlash draft, GPU/CPU FFN overlap, persistent pre-FFN graph for DeltaNet, 4-5x decode speedup, MoE perf telemetry - PR Luce-Org#263 weicj/feat-cpp-server-pflash-draft-placement: mixed-backend PFlash phase split - 648a6e2 perf: GPU-resident hybrid decode (eliminate PCIe round-trips) - d12ddde fix: dynamic placement uses --max-ctx instead of hardcoded 8192 Conflict resolution: where this branch has post-Luce-Org#269 refinements (gemma4 timings via 4e9abda, degenerate-decode watchdog via c2d725f / 8538ff9, laguna chat-template fix via 92f84cd, transition cue via 16bb31e, thinking-budget force-close fix via b86342d, etc.), keep ours. Where the only divergence is "this branch has 50 small commits that 403e598 squashes," accept the merged result. # Conflicts: # dflash/scripts/server.py # dflash/src/qwen35/qwen35_backend.cpp # dflash/src/qwen35/qwen35_backend.h # dflash/src/server/http_server.h # dflash/src/server/sse_emitter.cpp # dflash/test/test_server_unit.cpp # share/model_cards/laguna-xs.2.json

server_main.cpp had two identical 119-line blocks of the thinking-budget v2 model-card resolution code (general.name / general.architecture read, resolve_model_card(), ServerConfig application, tier clamping). g++ errored out on redeclarations: redeclaration of 'std::string general_name' redeclaration of 'std::string general_arch' redeclaration of 'dflash::common::ModelCard card' redeclaration of 'const int tier_ceiling' conflicting declaration 'auto clamp_tier' The duplicate was a merge artifact from 1df9099 (luce-org/main into integration/-clean post-rename). Upstream PR Luce-Org#269 squashed our pre-PR work into 403e598 and our integration branch carried the same code in unsquashed form; the 3-way merge kept both copies. Also drop a redundant `#include "gguf.h"` (lines 23 and 25 of the same file). Harmless thanks to include guards but ugly merge residue. Build proceeds past cmake configure with the dedupe.

…nch in-tree Collapses 134 commits of `integration/props-uv-squared-clean` onto current main as one reviewable change. Most of the underlying server-side work already landed via separate PRs: thinking-budget v2 + multi-dialect reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping (Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of smaller fixes from bragi over the last week. What remained in integration is everything *above* the server: the host-side runner, the container image, the benchmark/profile evidence pipeline, the harness for driving real clients, and the luce-bench framework itself. ## What changed ### Docker + host wrapper - Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/ into one image; wires `python -m lucebench.cli` as the `benchmark` entrypoint subcommand). - `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi): `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/ `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`. - `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub` tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`). - `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple targets are present in models/. ### lucebox Python package (in-container CLI) - `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier selection + per-host config writeback), `config.py` (typed TOML), `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`, `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`, `types.py`. - `lucebox/tests/` for the typed surfaces. - Level1/Level2/Level3 profile gates; sweep results merged back into `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/ agentic-session validation gates pass. ### luce-bench in monorepo - `luce-bench/` workspace member at v0.2.4 — the standalone bench framework (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot output; v0.2.4 includes the forge area's EvalConfig + run_scenario signature realignment). - `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior git tag pin. - `.github/workflows/release-luce-bench.yml` publishes to PyPI from the monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment). ### harness workspace - `harness/` workspace member: client adapters (`claude_code`, `codex`, `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`, `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile` delegates the actual bench runs to harness. ### Bench + profile evidence - `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile artifacts. Snapshots themselves live in the standalone `luce-bench-baselines` repo (out of this tree). ### Misc - Updated CI workflow path filters for `server/` (post-rename). - README's "Quick start" section, hardware coverage table, env var reference table; minor edits to optimizations READMEs. - model card sidecar updates landed alongside Luce-Org#269 but kept here at current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2, `_schema.json`). ## Out of scope / follow-ups - 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json` shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode path already proven). - gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not applied to gemma4 yet). - Multi-Token Prediction (upstream PR #23398, draft). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai Bot reviewed May 24, 2026

View reviewed changes

Comment thread dflash/src/server/http_server.cpp

Comment thread docs/specs/thinking-budget.md Outdated

easel force-pushed the pr/server branch 5 times, most recently from 63bd7a8 to 9b8fbb8 Compare May 24, 2026 00:57

easel force-pushed the pr/server branch from 9b8fbb8 to 0b9ad53 Compare May 24, 2026 01:22

easel force-pushed the pr/server branch 6 times, most recently from 952d04c to cb33fb2 Compare May 24, 2026 02:57

easel force-pushed the pr/server branch 12 times, most recently from 269f4ef to 109cebd Compare May 24, 2026 16:18

easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026

Merge pr/server (e245b6b) — budget-signaling overhaul lands on PR Luc…

c9ecdf4

…e-Org#269

easel force-pushed the pr/server branch from 6660749 to 97ec0d3 Compare May 25, 2026 22:31

easel force-pushed the pr/server branch from 97ec0d3 to 403e598 Compare May 26, 2026 16:08

davide221 mentioned this pull request May 26, 2026

dflash: integrate BudgetHook with spec-decode (PR #269 follow-up) #279

Open

davide221 merged commit 8b139ce into Luce-Org:main May 26, 2026
3 checks passed

This was referenced May 27, 2026

docker, cli, smoke, bench and autotune first-run dx #226

Closed

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cpp-server): thinking-budget v2 + multi-dialect reasoning aliases + spec#269

feat(cpp-server): thinking-budget v2 + multi-dialect reasoning aliases + spec#269
davide221 merged 2 commits into
Luce-Org:mainfrom
easel:pr/server

easel commented May 23, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easel commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Validation

Out of scope / follow-ups

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

easel commented May 23, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading