Skip to content

feat(cpp-server): thinking-budget v2 + multi-dialect reasoning aliases + spec#269

Merged
davide221 merged 2 commits into
Luce-Org:mainfrom
easel:pr/server
May 26, 2026
Merged

feat(cpp-server): thinking-budget v2 + multi-dialect reasoning aliases + spec#269
davide221 merged 2 commits into
Luce-Org:mainfrom
easel:pr/server

Conversation

@easel

@easel easel commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Adds per-request thinking-budget controls, multi-dialect reasoning emission, and JSON model-card sidecars to the native C++ dflash server.

What changed

  • Thinking-budget mechanism (Level 2 BudgetHook + watchdog). Backend AR / spec-decode injects the model's </think> close-token sequence once when n_gen − committed ≤ hard_limit_reply_budget — KV-continuous, mid-stream, in-process, uniform across streaming and non-streaming. Degenerate-decode watchdog detects post-close runaway and aborts cleanly, surfaced as a flag in finish_details and bench output. Sidecar thinking_terminator_hint ↔ tokenizer resolves think_close_token_ids at startup so the hook is family-aware without hardcoded archs. Resolution order: CLI > sidecar > family fallback > hard fallback (default hard_limit_reply_budget = 4096; terse models override down).
  • Multi-dialect reasoning emission. SSE emitter splits reasoning ↔ content for OpenAI Chat (reasoning_content delta), Anthropic (separate thinking / text content-block lifecycle), and Responses API (reasoning stripped per Codex r1 P2). Qwen3.6 <think> / </think> special-token ids are forwarded as text into the emitter; gemma4 <|channel> / <channel|> are mapped onto the same channel so all archs share one state machine. first_content_token_index() derives the natural-close split from the REASONING→CONTENT transition; leading <think> opener is detected up-front so thinking_tokens accounts correctly for the Qwen3.6 streamed-thinking path.
  • /props endpoint. Wholesale model_card (verbatim sidecar JSON, validates against share/model_cards/_schema.json) + budget_envelope (effective think_max_tokens, default_max_tokens, hard_limit_reply_budget, effort_tiers, model_card_source label) + runtime fields (chunk, target_device, draft_device, speculative_enabled, fa_window, ddtree_budget, kv_cache_k/v, runtime_backend). Captured by server_main at startup so the handler doesn't crack BackendArgs.
  • Model-card sidecars. share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.json each ship max_tokens, complex_problem_max_tokens, hard_limit_reply_budget, thinking_terminator_hint, sampling defaults, and reasoning_effort_tiers as applicable. share/model_cards/_schema.json provides the JSON Schema for validation. model_card.cpp resolver does keyword + family-fallback lookup, with the startup banner reporting the resolved model_card_source so operators can confirm which envelope is in force.
  • Spec docs. docs/specs/thinking-budget.md (mechanism, resolution order, CLI surface, finish_details / close_kind contract), docs/specs/model-cards.md (sidecar field reference, family fallback table, ship/override guidance), docs/specs/props-endpoint.md + docs/specs/openapi-props.yaml (shape + OpenAPI 3 schema for the /props payload).

Validation

  • cmake --build dflash/build -j clean (warnings only on pre-existing -Wunused-result fread calls in unrelated test code).
  • ctest -V -R server_unit → 1542 assertions, 0 failures, including:
    • SSE emitter reasoning split across OpenAI Chat / Anthropic / Responses.
    • first_content_token_index across natural-close, never-closed, content-only, and Qwen3.6 streamed-thinking (new regression test added in this PR — leading <think> opener no longer captures fci=0).
    • /props body shape (wholesale sidecar + family-fallback null).
    • usage.timings across all three response shapes.
  • Confirmed the regression test catches the underlying bug: reverting dflash/src/server/sse_emitter.cpp to pre-fix state makes test_emitter_first_content_index_qwen36_streaming_thinking fail with em.first_content_token_index() > 0.

Out of scope / follow-ups

  • 31b backend wiring. share/model_cards/gemma-4-31b-it.json ships but the dense-30.7B backend path is not yet integrated; the sidecar is in place for when it lands.
  • gemma4 MoE expert-split. gemma-4-26b-a4b-it runs end-to-end at this PR's level of integration, but the MoE expert-split optimization is howard0su's PR Split MoE weights between CPU & CUDA, support qwen35moe models #262 territory.
  • MTP (Multi-Token Prediction). Tracked by upstream PR #23398; orthogonal to the thinking-budget / sidecar work here.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 9 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/server/http_server.cpp
Comment thread docs/specs/thinking-budget.md Outdated
@easel easel force-pushed the pr/server branch 5 times, most recently from 63bd7a8 to 9b8fbb8 Compare May 24, 2026 00:57
easel added a commit to easel/lucebox-hub that referenced this pull request May 24, 2026
`maybe_force_close`'s `remaining = n_gen - committed_now` mixed two
coordinate systems: `committed_now` is the absolute KV position
(= prompt_len + tokens generated this AR call), while `n_gen` is
gen-only. The subtraction treated prompt-length tokens as if they
were generated output, firing `--hard-limit-reply-budget` `prompt_len`
tokens early on every prompted request. For the spec-decode → AR
tail-off path, `n_gen` is remapped to remaining-budget, so the
subtraction could go negative and force-close immediately as AR took
over — defeating the hard-limit reply budget.

Fix: capture `committed_at_entry = committed` at entry, compute
`generated = committed_now - committed_at_entry`, then
`remaining = n_gen - generated`. Both call sites (post-prefill AR
fallback at line 919 and spec-decode tail-off at line 985) now have
correct budget arithmetic in the "generated since entry" frame.

Caught by codex review on PR Luce-Org#269.
easel added a commit to easel/lucebox-hub that referenced this pull request May 24, 2026
…tracked split

`finish_details.thinking_tokens` and
`usage.completion_tokens_details.reasoning_tokens` used
`phase1_tokens.size()` directly. When the model naturally emits
`</think>` mid-stream and continues writing visible content, all of
those post-close tokens were still being counted as thinking, and
`content_tokens` was reporting only the phase-2 output — so the split
mis-matched the `reasoning_content` / `content` strings the same
response carried.

Fix: expose `SseEmitter::mode()` and have the non-streaming
`feed_tokens` lambda increment per-mode counters based on the
emitter's mode BEFORE each `emit_token` call. Tokens that trigger
the `</think>` transition are attributed to REASONING (they carry
the close tag); the next token is the first CONTENT. Use those
counters for both `finish_details.thinking_tokens`/`content_tokens`
and `usage.completion_tokens_details.reasoning_tokens`.

Caught by codex review on PR Luce-Org#269.
easel added a commit to easel/lucebox-hub that referenced this pull request May 24, 2026
`stats()` and `full_stats()` read `lifetime_hits_`, `full_lifetime_hits_`,
and `full_disk_bytes_` from a client thread while the daemon thread
mutates them via `lookup()` / `lookup_full()`. In C++ that's a
data-race / UB, not just a tearable value (codex review).

Fix: declare the three counters `std::atomic<int64_t>`, use
`fetch_add(1, memory_order_relaxed)` on writes and `.load(relaxed)`
on reads. Relaxed ordering is sufficient — no synchronization with
other state is required.

`entries_.size()` / `full_entries_.size()` are still read without a
lock; that's a single integer load on all libstdc++/libc++ targets
and matches the Python impl's tear-tolerant introspection contract.
Acceptable for an ops dashboard; documented in the spec header
comment alongside `stats()`/`full_stats()`.

Caught by codex review on PR Luce-Org#269.
easel added a commit to easel/lucebox-hub that referenced this pull request May 24, 2026
`/v1/messages` and `/v1/messages/count_tokens` carried near-identical
copies of Anthropic's `system`-field normalization — both must accept
string or typed-block-array shapes, strip Claude Code's
`x-anthropic-billing-header:` blocks, then prepend the remainder as a
`{role:"system", content:...}` message. Duplicate logic risked drift
between generation and token counting.

Extract `normalize_anthropic_system(body, messages)` and replace both
call sites. Net change: -13 lines, one canonical implementation.

Caught by cubic review on PR Luce-Org#269.
@easel easel force-pushed the pr/server branch 6 times, most recently from 952d04c to cb33fb2 Compare May 24, 2026 02:57
easel added a commit to easel/lucebox-hub that referenced this pull request May 24, 2026
…ffort + per-request controls

Folds in the codex re-review fixes and new design surface from PR Luce-Org#269
(pr/server tip cb33fb2):

- share/model_cards/qwen3.6-27b.json sidecar (32768 / 81920 from the HF
  card) + README. Server reads at startup via GGUF general.name lookup.
- dflash/src/server/model_card.{h,cpp}: sidecar parser, per-family
  fallback table (Qwen3.5/3.6, Gemma4, Laguna), hard fallback to
  antirez/ds4 ds4_eval.c reference values, effort-tier formulas.
- ServerConfig: EffortTiers, SamplingDefaults, model_card_source_label.
  CLI flags --reasoning-effort-{low,medium,high,x-high,max} override
  per-tier values; banner reports source per knob.
- Request parser: thinking.budget_tokens + thinking.reply_budget
  (clamped to server ceiling, never looser); 5-tier reasoning.effort
  with x-high/max as dflash extensions of OpenAI's 3-tier vocab;
  unknown effort tier falls back to high. §4.3 combined precedence:
  budget_tokens wins, effort still selects defaults.
- GenerateResult::budget_forced_close flag: backends signal when L2
  injected </think>; HTTP server uses it to attribute close_kind="hard"
  (was previously a text-grep against decoded phase1 which couldn't
  distinguish injected from natural close).
- L1 phase-2 reprompt: emit only </think> through the
  REASONING→CONTENT transition (the prior "\n\nFinal answer: "
  scaffolding leak prefixed every hard-closed response).
- Gemma4 tail-off: seed out_tokens with last_tok when empty before the
  iter-0 do_decode call (eliminates UB on small-budget requests).
- --max-tokens legacy CLI flag now wired as documented alias for
  --default-max-tokens (was parsed to a dead field).
- Spec rewritten to ~500 lines covering: model-card resolution order,
  sidecar JSON shape with explicit reasoning_effort_tiers override,
  5-tier effort vocab, per-request clamp rules, why client controls
  are bounded rather than full overrides.

CMakeLists.txt adds model_card.cpp to dflash_server + test_server_unit
targets. test_server_unit 1426/1426 PASS on pr/server build.
@easel easel force-pushed the pr/server branch 12 times, most recently from 269f4ef to 109cebd Compare May 24, 2026 16:18
easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026
…e] log

Snapshot/bench tooling can now read the full effective runtime config
off /props.runtime in one shot. Three new fields under runtime:

  chunk          — prefill chunk size (bargs.chunk)
  target_device  — resolved target-model placement ("auto:0", "cuda:0")
  draft_device   — resolved draft-model placement, null when no draft

Why: of 98 historical bench snapshots, only 1 had populated server_info
(post-c35a8a4). We just discovered that downstream ops have been
forcing --cache-type-k/v q4_0 since 2026-05-23 while the binary auto-
default is tq3_0 (when max_ctx>6144 on CUDA). Without /props capturing
the full runtime config, that 2-day divergence was invisible to
forensics. Adding chunk/target_device/draft_device closes the rest of
the introspection gap.

Companion: qwen35 do_ar_decode now emits an `[ar-decode] tokens=N
time=Ts speed=X tok/s` line to stderr per request — parity with the
existing `[spec-decode]` summary. AR fires when sampling needs logit
processing (temp != 0), which the existing log path didn't cover.

Spec updates:
  docs/specs/openapi-props.yaml — Runtime schema extended; example
    payload updated.
  docs/specs/props-endpoint.md — new §4.16 documenting the full
    runtime block (the six pre-existing + three new).

Test: test_props_runtime_shape pins the full field set + null-draft
case. 1497 assertions pass.

This is the PR-side port of the same patch on integration tip 3ed5062.
The integration tip also carries data snapshots (perf-sweep) and
operational docs (KV q4 vs tq3 A/B design, run-request addenda) which
are not in scope for PR Luce-Org#269.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026
…-Org#269

PR Luce-Org#269 now carries chunk + target_device + draft_device fields in
/props.runtime, the [ar-decode] log emission, openapi/props-endpoint
spec updates, and test_props_runtime_shape. Same surface as the
integration commit 3ed5062.

This merge keeps the two branches aligned on the shared code/spec
content. Integration retains its data snapshots + operational docs
that are out of PR Luce-Org#269's scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026
…1_cap fix

Same surface as integration tip 069c412, code-only port — drops the
operational snapshot + run-request docs that don't belong on PR Luce-Org#269.

Three layers, all engine-owned:

Phase A — hard_limit_reply_budget made per-model
  share/model_cards/_schema.json: optional `hard_limit_reply_budget`
  share/model_cards/qwen3.6-27b.json: declared at 4096 (verbose-post-close)
  dflash/src/server/model_card.cpp: read field from sidecar, family
    fallback for qwen35/36/3 → 4096
  docs/specs/{thinking-budget.md,model-cards.md}: documented
  BUG FIX in http_server.cpp: phase1_cap = min(think_max + reply_budget,
    max_output). Without the `+ reply_budget` term, force-close fires
    at the FIRST AR iter when hard_limit equals think_max. Latent bug
    exposed by the hard_limit bump.

Phase B — soft-limit negotiated close (spec §5.3)
  BudgetHook gains soft_limit_remaining + soft_limit_close_rank
  ModelCard / ServerConfig / sidecar / schema: soft_limit_reply_budget
    (default 0 = disabled) + soft_limit_close_rank (default 8)
  qwen35_backend do_ar_decode: pre-hard-limit branch peeks top-K of
    logits; if `</think>` ranks in top-K AND remaining is in the
    (hard, soft] window, accept the close. Ports ds4_eval.c:3030's
    two-tier strategy.
  GenerateResult.budget_soft_close → close_kind="soft" in finish_details

Phase C — chat-template thinking_preamble (spec §3.3 / §5.4)
  Sidecar fields thinking_preamble (template with {think_max}/{reply_max}
    substitution) + thinking_preamble_format (comment|directive|none)
  Qwen3.6 sidecar ships an HTML-style comment preamble — Qwen3.6 has
    no trained native reasoning-effort signal, so we communicate
    budget via a comment in the reasoning region. Model self-plans.
  chat_template.cpp render_chat_template: injects preamble after the
    opening `<think>` when enable_thinking=true. Caller substitutes
    placeholders with EFFECTIVE per-request values (after §4.4
    clamping), not server ceilings.

Single-case demonstration (recNu3MXkvWUzHZr9 GPQA Diamond, correct=B):
  BEFORE        given=C ✗ thk=3585 wall=244s  (reply clipped by force-close)
  AFTER-A       given=B ✓ thk=4097 wall=323s  (the fix)
  AFTER-A+B+C   given=B ✓ thk=0   wall=48s    (model self-closed, 6.7× faster)

Bench/snapshot tooling (post c35a8a4 + 3ed5062) wholesale-captures
/props.budget_envelope so the per-config knobs land in result.json
without further bench edits.

Test surface: test_server_unit 1497 assertions / 0 failures across all
phases (Phase A, A+B, A+B+C). No props_schema bump needed — all new
fields are additive under budget_envelope + model_card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026
easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026
… (PR Luce-Org#269 candidate)

The GEMMA4 case in chat_template.cpp didn't reference thinking at all,
so:

  - With enable_thinking=false: model self-emits its own
    <|channel>thought\n...<channel|> sequence and the rendered text
    (e.g. the literal "thought" token + newlines) leaks into visible
    content. Reproduced on bragi qwen via plain prompts like
    "What is 2+2?" returning '\n4//thought\n\n4\nthought\n\n4'.

  - With enable_thinking=true: nothing in the prompt signals the model
    that it should route through the channel-thought block, so
    reasoning_content stays empty and the model never enters proper
    thinking mode.

Fix per the chat template embedded in the Gemma4 GGUF metadata
(google/gemma-4-26B-A4B-it):

  1. enable_thinking=true → emit `<|think|>\n` at top of system turn
  2. add_generation_prompt + !enable_thinking → append
     `<|channel>thought\n<channel|>` after `<|turn>model\n` so the
     model SKIPS its own thought channel (the Qwen3 `<think></think>`
     guard analog)
  3. system turn now wraps system content properly instead of
     prepending it inline to the first user message (which broke when
     the first message wasn't user)

cherry-pick: PR Luce-Org#269 (thinking-budget v2 series) — strictly a
thinking-routing fix, no decode-path or model-arch change.

Repro: bragi /v1/chat/completions with gemma-4-26b-A4B-it Q4_K_M;
post-fix the same prompts should return clean content without any
"thought" substring leaking.

Run-request: dflash/docs/run-requests/bragi-gemma4-laguna-config-issues.md
(item #1 and #2; #3 the smoke-mc ggml_mul_mat crash is a separate
fix targeted at an independent PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026
After landing the chat-template fix (f1d30f2) and rebuilding, the
//thought leakage is mostly gone (Hello-world, Continue sequence, real
prompts all return clean content). But a separate hard crash surfaces:

  GGML_ASSERT(ggml_can_mul_mat(a, b)) failed at ggml.c:3243
  → container exits with 139 immediately

Reproduces deterministically with a single 169-token prompt (the first
HumanEval case). Trivial prompts (<20 tokens) work fine. The crash is
INDEPENDENT of the chat-template work — happens with no system message
either. This is a tensor-shape mismatch in the gemma4 forward path
(Gemma4Backend / gemma4_decode.cu), needs sindri's eye on which
mul_mat call is asserting.

This belongs in an independent PR, not the Luce-Org#269 thinking-budget series.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 25, 2026
Update gemma4 RR to reflect that the chat-template fix (f1d30f2) is
landed in PR Luce-Org#269 and the remaining ggml_can_mul_mat crash is being
investigated by erik/Claude — will ship as an encapsulated independent
PR. Add bisection notes: structure-specific (HumanEval prompt at 169
tokens crashes; 466 tokens of plain repeated text runs clean), single
user message reproduces, 24 candidate mul_mat call sites in
gemma4_graph.cpp. Stripped binary → next step is debug-symbol build or
per-mul-mat name-tagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-card sidecars

Adds per-request thinking-budget controls, multi-dialect reasoning
emission, and JSON model-card sidecars to the native C++ dflash server.

## Thinking-budget mechanism

- Level 2 BudgetHook: backend AR / spec-decode injects the model's
  `</think>` close-token sequence once when `n_gen − committed
  ≤ hard_limit_reply_budget`. KV-continuous, mid-stream, in-process —
  applies uniformly to streaming and non-streaming requests.
- Degenerate-decode watchdog: detects post-close repetition / runaway
  decode and aborts cleanly, surfacing the flag in finish_details and
  bench output.
- Sidecar `thinking_terminator_hint` field: per-family close-token
  text (e.g. `</think>\n\n` for qwen3, `<channel|>\n\n` for gemma4)
  resolved at startup from the model card; `think_close_token_ids` is
  populated by tokenizing the hint, so the BudgetHook is family-aware
  without hardcoded archs.
- Resolution order documented in spec §3: CLI > sidecar > family
  fallback > hard fallback. Hard fallback `hard_limit_reply_budget`
  defaults to 4096 (raised from 512); terse models should override
  down via sidecar.

## Multi-dialect reasoning emission

- SSE emitter splits reasoning ↔ content for OpenAI Chat
  (`reasoning_content` delta), Anthropic (thinking/text content
  blocks with separate lifecycle), and Responses API (reasoning
  stripped per Codex r1 P2).
- Qwen3.6 `<think>` / `</think>` special token ids are forwarded as
  text into the emitter; gemma4 `<|channel>` / `<channel|>` are
  mapped onto the same channel so all archs share one state machine.
- `first_content_token_index()` derives the natural-close split from
  the REASONING→CONTENT transition; leading `<think>` opener is
  detected before fci capture so thinking_tokens accounts correctly
  for the Qwen3.6 streamed-thinking path.

## /props endpoint

- Wholesale model_card (verbatim sidecar JSON, validates against
  share/model_cards/_schema.json) + budget_envelope (effective
  think_max_tokens, default_max_tokens, hard_limit_reply_budget,
  effort_tiers, model_card_source label) + runtime fields (chunk,
  target_device, draft_device, speculative_enabled, fa_window,
  ddtree_budget, kv_cache_k/v, runtime_backend). Captured by
  server_main at startup so the handler doesn't crack BackendArgs.

## Model-card sidecars

- share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,
  laguna-xs.2}.json — each ships max_tokens, complex_problem_max_tokens,
  hard_limit_reply_budget, thinking_terminator_hint, sampling defaults,
  and reasoning_effort_tiers as applicable.
- share/model_cards/_schema.json — JSON Schema for sidecar
  validation, exercised by server_main loader and shipped for
  third-party authors.
- model_card.cpp resolver: keyword + family-fallback lookup, with
  startup banner reporting the resolved `model_card_source` so
  operators can confirm which envelope is in force.

## Spec docs

- docs/specs/thinking-budget.md — mechanism, resolution order, CLI
  surface, finish_details/close_kind contract.
- docs/specs/model-cards.md — sidecar field reference, family
  fallback table, ship/override guidance.
- docs/specs/props-endpoint.md + docs/specs/openapi-props.yaml —
  shape and OpenAPI 3 schema for the /props payload.

## Tests

- test_server_unit gains coverage of the SSE emitter reasoning split
  (OpenAI / Anthropic / Responses), first_content_token_index across
  the natural-close, never-closed, content-only, and Qwen3.6
  streamed-thinking paths, /props body shape (wholesale sidecar +
  family-fallback null), and usage.timings across all three response
  shapes. 1542 assertions, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves merge conflict in dflash/src/qwen35/qwen35_backend.cpp at the AR
fallback call site. PR Luce-Org#269 added a BudgetHook + forced_close/degenerate
output params to do_ar_decode(); upstream main introduced a
run_ar_decode_path() wrapper. Both symbols coexist — the conflict was
only at the call site inside generate_dflash(). Kept PR Luce-Org#269's direct
do_ar_decode() invocation so the thinking-budget force-close still fires
when spec-decode is unavailable; the upstream wrapper would have lost
the hook.

Tested on lucebox2 (RTX 3090, CUDA 12.6, sm_86):
  - cmake build clean for all PR Luce-Org#269 + PR Luce-Org#262 targets
  - dflash_server end-to-end on Qwen3.6-27B-Q4_K_M with reasoning.effort=low:
    BudgetHook fired at committed=770/1000 remaining=256 == hard_limit=256
    finish_details.close_kind=hard, degenerate watchdog triggered cleanly

Co-Authored-By: WOZCODE <contact@withwoz.com>
@davide221 davide221 merged commit 8b139ce into Luce-Org:main May 26, 2026
3 checks passed
easel added a commit to easel/lucebox-hub that referenced this pull request May 26, 2026
Brings in 52 upstream commits since merge-base 8c23234 (2 days ago).
The headline is PR Luce-Org#269 (`403e598 feat(cpp-server): thinking-budget
v2 + multi-dialect reasoning + model-card sidecars`) — the squashed
version of our own thinking-budget v2 work that we'd been carrying
as 50+ small commits on this branch. Plus:

- PR Luce-Org#262 howard0su/powerinfer: hybrid MoE spec-decode with DFlash
  draft, GPU/CPU FFN overlap, persistent pre-FFN graph for DeltaNet,
  4-5x decode speedup, MoE perf telemetry
- PR Luce-Org#263 weicj/feat-cpp-server-pflash-draft-placement: mixed-backend
  PFlash phase split
- 648a6e2 perf: GPU-resident hybrid decode (eliminate PCIe round-trips)
- d12ddde fix: dynamic placement uses --max-ctx instead of hardcoded
  8192

Conflict resolution: where this branch has post-Luce-Org#269 refinements
(gemma4 timings via 4e9abda, degenerate-decode watchdog via c2d725f
/ 8538ff9, laguna chat-template fix via 92f84cd, transition cue via
16bb31e, thinking-budget force-close fix via b86342d, etc.), keep
ours. Where the only divergence is "this branch has 50 small commits
that 403e598 squashes," accept the merged result.

# Conflicts:
#	dflash/scripts/server.py
#	dflash/src/qwen35/qwen35_backend.cpp
#	dflash/src/qwen35/qwen35_backend.h
#	dflash/src/server/http_server.h
#	dflash/src/server/sse_emitter.cpp
#	dflash/test/test_server_unit.cpp
#	share/model_cards/laguna-xs.2.json
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
server_main.cpp had two identical 119-line blocks of the thinking-budget
v2 model-card resolution code (general.name / general.architecture
read, resolve_model_card(), ServerConfig application, tier clamping).
g++ errored out on redeclarations:

    redeclaration of 'std::string general_name'
    redeclaration of 'std::string general_arch'
    redeclaration of 'dflash::common::ModelCard card'
    redeclaration of 'const int tier_ceiling'
    conflicting declaration 'auto clamp_tier'

The duplicate was a merge artifact from 1df9099 (luce-org/main into
integration/-clean post-rename). Upstream PR Luce-Org#269 squashed our pre-PR
work into 403e598 and our integration branch carried the same code in
unsquashed form; the 3-way merge kept both copies.

Also drop a redundant `#include "gguf.h"` (lines 23 and 25 of the same
file). Harmless thanks to include guards but ugly merge residue.

Build proceeds past cmake configure with the dedupe.
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
…nch in-tree

Collapses 134 commits of `integration/props-uv-squared-clean` onto current
main as one reviewable change. Most of the underlying server-side work
already landed via separate PRs: thinking-budget v2 + multi-dialect
reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping
(Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of
smaller fixes from bragi over the last week. What remained in integration
is everything *above* the server: the host-side runner, the container
image, the benchmark/profile evidence pipeline, the harness for driving
real clients, and the luce-bench framework itself.

## What changed

### Docker + host wrapper
- Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/
  into one image; wires `python -m lucebench.cli` as the `benchmark`
  entrypoint subcommand).
- `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi):
  `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/
  `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`.
- `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub`
  tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`).
- `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture
  (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple
  targets are present in models/.

### lucebox Python package (in-container CLI)
- `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier
  selection + per-host config writeback), `config.py` (typed TOML),
  `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`,
  `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV
  cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`,
  `types.py`.
- `lucebox/tests/` for the typed surfaces.
- Level1/Level2/Level3 profile gates; sweep results merged back into
  `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/
  agentic-session validation gates pass.

### luce-bench in monorepo
- `luce-bench/` workspace member at v0.2.4 — the standalone bench framework
  (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot
  output; v0.2.4 includes the forge area's EvalConfig + run_scenario
  signature realignment).
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior
  git tag pin.
- `.github/workflows/release-luce-bench.yml` publishes to PyPI from the
  monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment).

### harness workspace
- `harness/` workspace member: client adapters (`claude_code`, `codex`,
  `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`,
  `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile`
  delegates the actual bench runs to harness.

### Bench + profile evidence
- `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile
  artifacts. Snapshots themselves live in the standalone
  `luce-bench-baselines` repo (out of this tree).

### Misc
- Updated CI workflow path filters for `server/` (post-rename).
- README's "Quick start" section, hardware coverage table, env var
  reference table; minor edits to optimizations READMEs.
- model card sidecar updates landed alongside Luce-Org#269 but kept here at
  current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2,
  `_schema.json`).

## Out of scope / follow-ups
- 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json`
  shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode
  path already proven).
- gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not
  applied to gemma4 yet).
- Multi-Token Prediction (upstream PR #23398, draft).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
…nch in-tree

Collapses 134 commits of `integration/props-uv-squared-clean` onto current
main as one reviewable change. Most of the underlying server-side work
already landed via separate PRs: thinking-budget v2 + multi-dialect
reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping
(Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of
smaller fixes from bragi over the last week. What remained in integration
is everything *above* the server: the host-side runner, the container
image, the benchmark/profile evidence pipeline, the harness for driving
real clients, and the luce-bench framework itself.

## What changed

### Docker + host wrapper
- Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/
  into one image; wires `python -m lucebench.cli` as the `benchmark`
  entrypoint subcommand).
- `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi):
  `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/
  `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`.
- `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub`
  tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`).
- `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture
  (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple
  targets are present in models/.

### lucebox Python package (in-container CLI)
- `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier
  selection + per-host config writeback), `config.py` (typed TOML),
  `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`,
  `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV
  cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`,
  `types.py`.
- `lucebox/tests/` for the typed surfaces.
- Level1/Level2/Level3 profile gates; sweep results merged back into
  `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/
  agentic-session validation gates pass.

### luce-bench in monorepo
- `luce-bench/` workspace member at v0.2.4 — the standalone bench framework
  (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot
  output; v0.2.4 includes the forge area's EvalConfig + run_scenario
  signature realignment).
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior
  git tag pin.
- `.github/workflows/release-luce-bench.yml` publishes to PyPI from the
  monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment).

### harness workspace
- `harness/` workspace member: client adapters (`claude_code`, `codex`,
  `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`,
  `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile`
  delegates the actual bench runs to harness.

### Bench + profile evidence
- `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile
  artifacts. Snapshots themselves live in the standalone
  `luce-bench-baselines` repo (out of this tree).

### Misc
- Updated CI workflow path filters for `server/` (post-rename).
- README's "Quick start" section, hardware coverage table, env var
  reference table; minor edits to optimizations READMEs.
- model card sidecar updates landed alongside Luce-Org#269 but kept here at
  current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2,
  `_schema.json`).

## Out of scope / follow-ups
- 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json`
  shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode
  path already proven).
- gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not
  applied to gemma4 yet).
- Multi-Token Prediction (upstream PR #23398, draft).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel added a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
…nch in-tree

Collapses 134 commits of `integration/props-uv-squared-clean` onto current
main as one reviewable change. Most of the underlying server-side work
already landed via separate PRs: thinking-budget v2 + multi-dialect
reasoning + sidecars (Luce-Org#269), dflash→server rename + optimizations/ grouping
(Luce-Org#281), qwen35moe hybrid CPU/CUDA expert split (Luce-Org#262), and a stream of
smaller fixes from bragi over the last week. What remained in integration
is everything *above* the server: the host-side runner, the container
image, the benchmark/profile evidence pipeline, the harness for driving
real clients, and the luce-bench framework itself.

## What changed

### Docker + host wrapper
- Dockerfile (CUDA 12.8 base; copies server/, lucebox/, harness/, luce-bench/
  into one image; wires `python -m lucebench.cli` as the `benchmark`
  entrypoint subcommand).
- `lucebox.sh` (~470 lines of host bash, zero deps beyond docker + nvidia-smi):
  `check`, `configure`, `pull`, `download-models`, `serve`, `install`/`start`/
  `status`/`logs` (user-systemd), `print-run`, `benchmark`, `profile`.
- `.github/workflows/docker.yml` builds + pushes `ghcr.io/luce-org/lucebox-hub`
  tags (`:cuda12`, `:vX.Y.Z-cuda12`, `:X.Y-cuda12`, `:sha-<short>-cuda12`).
- `server/scripts/entrypoint.sh` resolves draft GGUF by target architecture
  (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6); warns when multiple
  targets are present in models/.

### lucebox Python package (in-container CLI)
- `lucebox/` workspace member: `cli.py`, `autotune.py` (VRAM-tiered tier
  selection + per-host config writeback), `config.py` (typed TOML),
  `download.py`, `docker_run.py`, `host_check.py`, `host_facts.py`,
  `profile.py` (profile sweep across DFLASH_MAX_CTX × DFLASH_BUDGET, KV
  cache types, pFlash modes, lazy-draft, prefix-cache slots), `smoke.py`,
  `types.py`.
- `lucebox/tests/` for the typed surfaces.
- Level1/Level2/Level3 profile gates; sweep results merged back into
  `~/.lucebox/config.toml` only after capability + ds4-eval/agentic-tools/
  agentic-session validation gates pass.

### luce-bench in monorepo
- `luce-bench/` workspace member at v0.2.4 — the standalone bench framework
  (areas: ds4-eval, code, longctx, agent, forge; sweep + per-host snapshot
  output; v0.2.4 includes the forge area's EvalConfig + run_scenario
  signature realignment).
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior
  git tag pin.
- `.github/workflows/release-luce-bench.yml` publishes to PyPI from the
  monorepo on `luce-bench-v*` tags (trusted publisher, `pypi` environment).

### harness workspace
- `harness/` workspace member: client adapters (`claude_code`, `codex`,
  `opencode`, `hermes`, `pi`, `openclaw`), `client_test_runner.py`,
  `benchmarks/run_lucebox_vs_llamacpp.sh`, prompts. `lucebox profile`
  delegates the actual bench runs to harness.

### Bench + profile evidence
- `server/docs/BENCHMARK_SNAPSHOT_SPEC.md` — schema for tuning/profile
  artifacts. Snapshots themselves live in the standalone
  `luce-bench-baselines` repo (out of this tree).

### Misc
- Updated CI workflow path filters for `server/` (post-rename).
- README's "Quick start" section, hardware coverage table, env var
  reference table; minor edits to optimizations READMEs.
- model card sidecar updates landed alongside Luce-Org#269 but kept here at
  current values (qwen3.6, gemma-4-26b-a4b, gemma-4-31b, laguna-xs.2,
  `_schema.json`).

## Out of scope / follow-ups
- 31b backend wiring beyond what `share/model_cards/gemma-4-31b-it.json`
  shipped (working empirically @ 24GB on sindri AR-only; 26b spec-decode
  path already proven).
- gemma4 MoE expert split (howard0su's PR Luce-Org#262 territory; merged but not
  applied to gemma4 yet).
- Multi-Token Prediction (upstream PR #23398, draft).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants