Skip to content

docker, cli, smoke, bench and autotune first-run dx#226

Closed
easel wants to merge 135 commits into
Luce-Org:mainfrom
easel:integration/props-uv-squared-clean
Closed

docker, cli, smoke, bench and autotune first-run dx#226
easel wants to merge 135 commits into
Luce-Org:mainfrom
easel:integration/props-uv-squared-clean

Conversation

@easel

@easel easel commented May 19, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

@easel easel marked this pull request as draft May 19, 2026 13:13

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 issues found across 37 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread docker-bake.hcl
Comment thread README.md Outdated
Comment thread dflash/scripts/lucebox_bench.py Outdated
Comment thread dflash/scripts/entrypoint.sh Outdated
Comment thread dflash/scripts/bench_http_frontiers.py Outdated
Comment thread dflash/scripts/bench_http_capability.py Outdated
Comment thread server/scripts/entrypoint.sh Outdated
Comment thread dflash/scripts/entrypoint.sh Outdated
Comment thread lucebox.sh Outdated
Comment thread lucebox.sh Outdated
@easel easel force-pushed the integration/props-uv-squared-clean branch 3 times, most recently from 4d38d50 to 4621867 Compare May 20, 2026 16:46
@easel

easel commented May 20, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the review issues identified by cubic in commit 067f4ac.\n\nLocal verification:\n- uv run --frozen --extra dev ruff check .\n- uv run --frozen --extra dev python -m mypy --package lucebox\n- uv run --frozen --with pytest pytest dflash/scripts/test_lucebox_bench.py dflash/scripts/test_server.py lucebox/tests -q\n- bash -n lucebox.sh && bash -n dflash/scripts/entrypoint.sh\n- docker buildx bake --print cuda12\n- docker buildx bake --print cuda12-local

@easel easel force-pushed the integration/props-uv-squared-clean branch 2 times, most recently from 0067f9b to 743da47 Compare May 20, 2026 22:30
@easel easel force-pushed the integration/props-uv-squared-clean branch from 5c6b502 to 84ddd04 Compare May 22, 2026 14:59
easel and others added 17 commits May 22, 2026 15:49
Closes two of the three feature gaps between dflash/scripts/server.py
(Python, reference impl) and dflash/src/server/dflash_server (C++,
production runtime), as outlined in the migration plan.

## /props introspection
Wire /props in http_server.cpp::handle_client. JSON shape matches
server.py:1221-1312 key-for-key so cross-server consumers (autotune
sweeps, dashboards, lucebox profile/snapshot) see a stable contract.

ServerConfig grows the introspection inputs (arch, model_path,
draft_path, kv_cache_k/v, runtime_backend, fa_window, ddtree_budget,
speculative_enabled, target_sharding, tokenizer_id) populated by
server_main before HttpServer construction.

PrefixCache and ToolMemory gain stats() / full_stats() accessors with
the same lockless-snapshot semantics the Python impl documents — a
mutation under daemon_lock can tear in_use vs lifetime_hits across the
read pair; acceptable for /props.

PROPS_SCHEMA = 1 (matches Python's current schema). Bump only on
breaking changes: field renamed, removed, or semantics-changed.

## /v1/messages/count_tokens
Reuses the Anthropic message-parsing path from /v1/messages, short-
circuits after tokenization with {"input_tokens": N}. <1s on a hot
server (no generation).

## Tests
Integration coverage in dflash/scripts/test_server_integration.py:
- TestProps: top-level keys, server block shape, speculative_mode
  consistency, runtime backend, arch-gated capabilities, API endpoint
  registry, prefix_cache/tool_replay shapes.
- TestCountTokens: simple count, scaling with message length, system
  block handling, <1s budget assertion.

Thinking-budget surface (--think-max-tokens flag, finish_details,
thinking_opt_in tracking) is the second PR in this migration —
intentionally separated so the algorithm-design review (Level 1/2/3
fidelity question) doesn't block /props from landing.

Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean with
CUDA layers cached.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_details)

Closes the third feature gap in the Python→C++ server migration: the
thinking-budget wire surface so consumers (lucebox bench, dashboards)
can opt in to the envelope and see the close-info block on responses.

## What ships

- `--think-max-tokens` CLI flag (default 10000): cap on the phase-1
  reasoning generation when a request opts in.
- `--default-max-tokens` CLI flag (default 16000, matches
  antirez/ds4 ds4_eval.c's `.max_tokens`): combined cap used when a
  request omits max_tokens.
- `thinking: {type: "enabled"}` is now tracked as a presence-opt-in
  (`ParsedRequest.thinking_opt_in`) so the server can condition
  response shape on it.
- `finish_details` block on /v1/chat/completions responses when
  thinking is opted in. Fields match docs/specs/thinking-budget.md:43-58
  and server.py:2272 (close_kind, thinking_tokens, content_tokens,
  total_tokens).

## What's deferred (intentional)

The phase-1/phase-2 reprompt MECHANISM (port of server.py:2141-2196)
is not in this PR. close_kind always reports "natural" for now —
the C++ server doesn't yet force-close on hard_limit. The Python
server's existing behavior continues to be the reference impl.

Why ship the surface first:
- Unblocks consumers (the lucebox bench can stop sending custom
  envelope fields and just send standard OpenAI + this opt-in).
- Lets us land /props first without algorithm-design review on this PR.
- The Level 1 vs Level 2 vs Level 3 fidelity question (phase-1/phase-2
  reprompt vs true mid-stream force-close vs full eval_think_close_info
  reporting) is a separate design conversation — the wire shape stays
  the same regardless of which Level lands.

## Tests

Integration coverage in test_server_integration.py:
- TestThinkingBudget.test_finish_details_present_when_thinking_opted_in:
  asserts the block is emitted with valid types when
  `thinking:{type:enabled}` is sent; invariant
  `thinking_tokens + content_tokens == total_tokens`.
- TestThinkingBudget.test_finish_details_absent_when_thinking_not_opted_in:
  asserts the block is NOT emitted on a plain chat request, matching
  Python's server.py:2271 conditional.

## Follow-ups (separate PRs)

- Level 1 phase-1/phase-2 reprompt: port server.py:2141-2196 into
  worker_loop. Sets close_kind="hard" when phase-1 didn't emit
  </think> within think_max_tokens. ~1 day.
- Level 2 true in-process force-close: backend-level sampler hook,
  matches ds4_eval.c:3027-3056 hard_limit_reply_budget semantics.
  Adds --soft-limit-reply-budget / --hard-limit-reply-budget flags.
  Closes the OpenRouter→native pass-rate gap (~20 pts). ~2-3 days.

Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles three pieces:

## Level 1 thinking-budget mechanism (port of server.py:2141-2196)

When a request opts in via `thinking: {type: "enabled"}` and the model
fails to emit `</think>` within `--think-max-tokens`, the server now
caps phase-1, decodes the reasoning, appends `</think>\n\nFinal answer: `,
re-prefills, and runs phase-2 for `max_tokens - phase1_emitted` tokens.

Non-streaming only. Streaming phase-2 is a follow-up (needs SSE flush
+ re-open). finish_details.close_kind now flips "natural" → "hard" when
phase-2 fires; thinking_tokens / content_tokens report the real split.

This brings C++ to Python parity for thinking-budget enforcement and
should close most of the 26-percentage-point gap to the Python server's
ds4-eval pass rate that the bench has been measuring.

## Codex review fixes (against 3f600f9 + 8d6ff04)

1. /props `speculative_mode` was reporting "dflash" based on arch
   capability instead of `--ddtree`-active state, contradicting the
   `speculative.enabled` block in the same response. Now keyed on
   `config.speculative_enabled`.

2. `--default-max-tokens` was a dead flag: the request parser fell
   back to `config_.max_tokens` (legacy 4096) when clients omitted
   max_tokens, so the new 16000 default was never applied. Parser now
   reads `default_max_tokens` directly. Side effect: entrypoint.sh
   doesn't need to explicitly pass `--default-max-tokens` either; the
   16000 default is now what the server uses out of the box, restoring
   parity with the Python server's documented default.

3. Phase-2 reprompt could exceed `max_ctx` when the prompt already
   sits near the boundary, because ph2_prompt grew by phase1_tokens +
   closing_ids but ph2_gen_len was only clamped by remaining budget.
   Added a second clamp against `max_ctx - ph2_prompt.size() - 20`.

## Tests

`TestThinkingBudget.test_close_kind_natural_when_model_self_closes`
(easy prompt, model self-closes well under cap; asserts
close_kind=="natural"). Hard-close test deferred — requires a server
launched with very low `--think-max-tokens`, which the integration
suite doesn't currently parametrize.

## What's not in this commit

- `finish_details` block stays as-is. Codex's review didn't flag it
  as a problematic protocol invention despite a targeted prompt;
  reconsidering it is a separate cleanup if needed.
- Streaming phase-2 — Level 1.5 follow-up.
- Level 2 (in-process force-close via backend sampler hook) — next PR.
- Test coverage gaps around phase-2 hard-close path, multi-arch
  thinking-tag mapping (Gemma4 `<|channel>`), stop-sequence behavior
  across phases. Tracked for the follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… limit

`semantic_hint_present` in bench_http_capability.py parsed all `-?\d+`
matches in the model output via `int(match.group(0))`. A degenerate
model emission of a many-thousand-digit number trips Python 3.11+'s
default 4300-digit int() limit and crashes the bench mid-run.

Cap match length at 20 digits before parsing — real answers never
exceed that, and longer runs can't ever equal an expected answer
anyway. Surfaced as a hard crash on row 8 of the local lucebox v2
ds4-eval run yesterday.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The C++ chat template was matching Qwen3.5's behavior (let the model
decide whether to emit <think>) when enable_thinking=true. Qwen3.6's
official chat_template.jinja actually pre-opens the thinking block:

  enable_thinking=true  → suffix `assistant\n<think>\n`
  enable_thinking=false → suffix `assistant\n<think>\n\n</think>\n\n`

Without this prefix, requests that opted into thinking via
`thinking:{type:enabled}` or `chat_template_kwargs.enable_thinking=true`
silently stayed in non-thinking mode on Qwen3.6 — the model would
answer directly without reasoning, no </think> tag ever appeared, and
the Level 1 phase-2 reprompt mechanism never fired because the
started_in_thinking flag never flipped true.

Verified against transformers.AutoTokenizer for Qwen/Qwen3.6-27B:

  >>> tok.apply_chat_template(msgs, enable_thinking=True,
  ...                         add_generation_prompt=True, tokenize=False)[-20:]
  '<|im_start|>assistant\\n<think>\\n'

Hot fix — caught while smoke-probing the freshly-rebaked C++ image
before running the full ds4-eval bench. Without this, the bench would
have produced numbers no better than non-thinking mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a standalone benchmark script that compares Q4_0 vs TQ3_0 KV
cache at 32K / 64K / 128K context lengths, measuring both prefill
time and DFlash decode tok/s.

The point isn't throughput (TQ3 is slightly slower than Q4_0 at
short contexts) — it's the memory saving: TQ3_0 (3.5 bpv) uses 22%
less KV than Q4_0 (4.5 bpv), enabling longer contexts on the same
VRAM budget. Automatically uses layer-segmented prefill
(DFLASH27B_LAYER_PREFILL=1) for prompts over 8K tokens to reduce
peak activation memory.

Ported from feat/setup-results-uv@c725758; the README/RESULTS.md
changes from that commit were dropped (clean has richer/newer
versions). Only the new bench script is retained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ax_tokens banner

Three fixes prepping for the phase-2 trigger reproducibility matrix
(codex review plan step 1-3):

## http_server.cpp — phase-2 gate-input logging
Per-request stderr line printing all 8 inputs to the phase-2 trigger
condition: thinking_opt_in, started_in_thinking, stream,
client_disconnected, phase1_tokens.size(), result.ok, req.max_output,
phase1_cap, ph2_gen_len_est, close_in_phase1 (decode), and the last
~10 tokens of effective_prompt decoded. Lets us correlate
probe-vs-bench, cache-on-vs-off, pflash-on-vs-off behavior when
phase-2 fires inconsistently. Strip after Level 1 is stable.

## entrypoint.sh — draft path resolution (F1)
The native dflash_server expects --draft to be a FILE path. The
Python server's resolve_draft() walked the dir to find a GGUF; the
C++ entrypoint was passing the directory directly, producing
`draft load: mmap: No such device` on container startup. Resolve
DFLASH_DRAFT to the largest dflash-draft-*.gguf inside the dir before
passing to dflash_server. Removes the need for users to set
DFLASH_DRAFT to a specific file path explicitly.

Also includes the in-flight migration plumbing that was sitting in WT:
- entrypoint header comment updated to "execs the native dflash_server"
- `DFLASH_SERVER_BIN` env default
- existence check now targets `$DFLASH_SERVER_BIN` instead of `$DFLASH_BIN`

## server_main.cpp — max_tokens banner truth + --prefix-cache-slots
The startup banner printed `max_tokens = 4096` (legacy sconfig.max_tokens
default) even though the request parser actually defaults to
default_max_tokens=16000. Now prints default_max_tokens with a label
explaining it's the request-omit default, plus a separate line for
think_max_tokens (the phase-1 cap when opted in). Codex review #2.

Also includes the in-flight --prefix-cache-slots CLI flag + parser
branch + startup-log line that was sitting in WT (used by entrypoint.sh
to forward $DFLASH_PREFIX_CACHE_SLOTS to the C++ server).

## Note on excluded WT
Dockerfile and lucebox/lucebox/docker_run.py also have in-flight
changes from this WT but they're orthogonal to the phase-2 diagnostic
work — leaving them unstaged for whoever owns them to commit
separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches dflash/scripts/bench_ds4_eval.py's DS4_EVAL_MAX_TOKENS = 16000,
which mirrors antirez/ds4 ds4_eval.c's .max_tokens default. At 4096 the
combined reasoning+reply budget truncated mid-CoT on harder cases —
AIME2025 wall times of 60-400s with garbage numeric answers in the May-21
interrupted trace are consistent with reasoning running into the cap.
Local dev builds typically only need the host's compute capability,
not the full 6-arch fat-binary the release image carries. The Dockerfile
already accepted DFLASH_CUDA_ARCHES as a build arg; this just promotes
it to a bake variable so a single env-var override skips the 5-6×
CUDA template recompile.

Example (RTX 5090 Laptop = sm_120, ~3min instead of ~20):

  DFLASH_CUDA_ARCHES=120 docker buildx bake cuda12-local --load

Default unchanged (75;80;86;89;90;120) so CI + release images get
full coverage without setting anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t 15488

Plumbs a new env var through the launch chain so the server's thinking-phase
cap matches upstream antirez/ds4 ds4_eval.c by default. The dflash_server
binary defaults --think-max-tokens to 10000 internally; nothing in
lucebox/docker_run.py + entrypoint.sh was setting it, leaving a 35% gap vs
upstream (15488 = 16000 max_tokens - 512 hard_limit_reply_budget).

For ds4-eval this manifests as truncated reasoning on AIME / hard GPQA cases
— the May-21 partial trace's nonsense AIME answers (case 51's 1e80-shaped
output) were consistent with hitting the 10000 cap mid-CoT.

Chain:
- types.py: add think_max field to DflashRuntime (default 15488)
- config.py: parse think_max from [dflash] section
- docker_run.py: emit DFLASH_THINK_MAX in server_run_spec + benchmark_run_spec
- profile.py: track think_max in runtime_tunables + live_tunables
- entrypoint.sh: default DFLASH_THINK_MAX=15488, pass --think-max-tokens N

Tests updated for the prior max_tokens 4096 -> 16000 fix that also moved.
The size-sorted resolver in 3e8323d picked model.safetensors (3.4GB
HF raw weights) over dflash-draft-3.6-q4_k_m.gguf (1GB DFlash draft)
because the safetensors file is bigger. dflash_server's spec-decode
verify then crashes with `verify_batch: embed failed (n=16)` because
the safetensors isn't in the format the DFlash draft path expects.

Replace size-sort with priority-ordered pattern match:
  1. dflash-draft-*.gguf  (the canonical DFlash quantized draft)
  2. *.gguf               (any other GGUF — last-resort)
  3. model.safetensors    (HF raw — only if no GGUF at all)
  4. *.safetensors

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.6's <think> (id 248068) and </think> (id 248069) are single
special tokens in the added_tokens vocab. Both http_server.cpp paths
(streaming on_token + non-streaming feed_tokens) had explicit handling
for Gemma4's <|channel>/<channel|> but NOT for Qwen's variants. The
generic "skip <...>" filter silently dropped the Qwen tokens, so the
emitter never saw the reasoning→content transition.

Symptom: ds4-eval scored 7/92 against the C++ server. 29 cases had
close_kind="natural" + content_tokens=0 — model emitted </think>
token 248069 but emitter dropped it, all 15488 tokens ended up in
reasoning_content with empty visible content. Bench's answer extractor
saw `content=""` and reported `given=? format=False`.

After fix: when model emits token 248068/248069, forward the text form
("<think>" / "</think>\n") into the emitter so parse_reasoning splits
the response correctly. The phase-2 trigger's close-detection (which
already uses decode() and sees </think> from special-token decoding)
keeps working identically — only the response builder changes.

Also explains why close_kind="hard" cases (phase-2 fired) DID pass:
the phase-2 emit_token("</think>\n\nFinal answer: ") manually injects
the close text, bypassing the missing-mapping bug.

Validated against the failing case's logs: model emits 248069, decoder
returns "</think>", emitter now transitions and content gets populated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Response shape — multi-dialect aliasing" section to v2 that
formalizes what dflash should emit for reasoning across DeepSeek /
OpenRouter / Anthropic / OpenAI-shaped clients.

Motivation: 2026-05-23 ds4-eval cross-server comparison against
OpenRouter qwen/qwen3.6-27b showed the bench discarding ~4000 reasoning
tokens because OR emits them under message.reasoning while dflash and
the bench use message.reasoning_content. Same data, different field
name — current dflash impl is correct but parochial.

Plan:
- Keep reasoning_content as primary (no break).
- Add reasoning as a flat alias.
- Add reasoning_details as a typed-block list — single-block today, room
  for phase-1/phase-2 splits later.
- Surface phase1_tokens / finish_details.thinking_tokens into
  usage.completion_tokens_details.reasoning_tokens to match OpenAI o1/o3
  shape and OR's normalized location.
- Patch bench_http_capability.py fallback chain to read all three.

Comparison table, full example response, per-field notes, motivation,
and implementation status (all four pieces "planned, not yet shipped")
included. No code change in this commit.
Implements antirez/ds4 ds4_eval.c's hard_limit_reply_budget at the
backend sampling layer. When a thinking-enabled request approaches
the budget boundary, the next sampled token is overridden with the
`</think>` close-tag token, giving the model the remaining budget
to write a visible answer with KV state continuous.

Beats Level 1 phase-2 reprompt because the model never sees a fresh
prefill — its reasoning context stays in KV cache and the answer
flows naturally after the injected close.

## Changes

- `common/model_backend.h`: new BudgetHook struct (close_token_id +
  hard_limit_remaining), opt-in field on GenerateRequest. Other
  backends ignore it (default-constructed = disabled).
- `qwen35_backend.{h,cpp}`: do_ar_decode honors budget_hook. Override
  fires once per generation (budget_close_injected flag prevents
  double-injection on subsequent loop iterations). Logs to stderr
  with [budget-hook] tag when it fires.
- `qwen35_backend::generate`: when budget_hook is set, routes through
  AR instead of spec-decode. Spec-decode integration is a follow-up
  (the perf hit is acceptable since this only affects thinking turns).
  Non-thinking turns still get full spec-decode throughput.
- `server_main.cpp`: new `--hard-limit-reply-budget N` flag (default
  512, matches ds4_eval.c). At startup, tokenizes "</think>" and
  caches the token ID in ServerConfig.think_close_token_id (only
  populated if it's a single token — Qwen3.6 = 248069). Multi-token
  close tags disable Level 2 with a warning, falling back to Level 1.
- `http_server.{h,cpp}`: ServerConfig grows hard_limit_reply_budget
  + think_close_token_id. worker_loop wires BudgetHook into
  GenerateRequest when the request opts into thinking AND the server
  has both knobs set.

## What it gives us

The model continues generating after the synthetic </think>, so its
visible content includes the actual answer in the budget remainder.
Mirrors ds4_eval.c's mid-stream injection exactly. Level 1 phase-2
reprompt remains as a fallback when force-close didn't fire (e.g.
model closed </think> early on its own — budget never tightened).

## What's deferred

- Spec-decode integration of BudgetHook. Current implementation
  trades spec-decode throughput for correctness on thinking turns.
  Adding the hook to do_spec_decode requires careful sequencing
  around the verify-and-accept loop (codex flagged this in plan
  review) — separate PR.
- Soft-close (voluntary close when </think> is in top-K). Needs
  top-K logits exposure from the verifier. Level 2.5.
- Laguna and gemma4 backends — only qwen35 wired. Per codex's "start
  qwen35 only, benchmark, then port" guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dflash/scripts/server.py was a parallel FastAPI implementation that
shadowed the C++ dflash_server. Keeping it invites bifurcation — any
fix to one had to be mirrored to the other, in practice always behind.

Drops:
- dflash/scripts/server.py (3484 lines, FastAPI app)
- dflash/scripts/test_server.py (imports 14+ symbols from server)
- dflash/scripts/test_prefix_cache.py (imports build_app from server)

Updates lucebox/lucebox/profile.py to remove test_server.py from the
pytest invocation and registered script_paths.

Out of scope (will break at runtime, fix in a follow-up):
- dflash/scripts/test_multi_turn_prefix_cache.py
- dflash/scripts/test_server_prefix_cache.py
- dflash/scripts/test_full_compress_cache.py
- dflash/scripts/bench_agent_loop.py
- dflash/scripts/bench_daemon.py
- dflash/scripts/bench_server.py
- dflash/scripts/parity_laguna.py
- dflash/scripts/quality_ab_simple.py
- dflash/scripts/quality_humaneval_plus.py
These reference server.py via subprocess.run / SERVER_SCRIPT path
constants. They'll fail at exec time, not import time, so they don't
block the deletion. Retargeting them to dflash_server is left as a
separate task — the bifurcation pressure is removed.

C++ source comments still reference server.py as historical context for
parity (e.g. http_server.cpp:1376 "see server.py:2271-2281"). Those
are notes, not load-bearing.
OPENAI_CHAT branch in http_server.cpp now emits the reasoning text under
three keys (reasoning_content, reasoning, reasoning_details), plus surfaces
the phase-1 token count under usage.completion_tokens_details.reasoning_tokens.
Implements docs/specs/thinking-budget.md "Response shape — multi-dialect
aliasing".

Bench (bench_http_capability.py) reads reasoning via fallback chain:
reasoning_content -> reasoning -> reasoning_details[].text. Makes cross-server
runs (sindri/vidar/OpenRouter) directly comparable; OR's reasoning is no
longer silently discarded.

Also drops a stale 'see server.py:2271-2281' comment now that the Python
server is gone.
Pulls in PR Luce-Org#260 (howard0su): fix(server): normalize Codex Responses
tool-call follow-ups, which improves the sse_emitter THINK_OPEN/CLOSE
extraction parser. Also picks up frequency/presence penalty sampling,
test tightening, and gemma4 HIP runtime fixes.

# Conflicts:
#	dflash/src/server/http_server.cpp
easel and others added 9 commits May 25, 2026 21:44
…ant>

Two related fixes that were keeping bragi's laguna unable to bench:

1. Chat template was DeepSeek-V3 tokens (<|begin▁of▁sentence|> /
   <|User|> / <|Assistant|>) — those token strings don't exist in
   laguna's vocab so the model saw replacement-character garbage in
   its prompt and degenerated into echoing the user message with
   <���Assistant���> artifacts. Real format (verified against
   poolside/Laguna-XS.2/chat_template.jinja):

     〈|EOS|〉<system>
     {content}
     </system>
     <user>
     {content}
     </user>
     <assistant>
     <think>      ← if enable_thinking
     </think>     ← if NOT enable_thinking (empty think block)

   Default system message used when one isn't supplied, matching the
   upstream template's "You are a helpful, conversationally-fluent
   assistant made by Poolside..." default.

2. laguna_target_loader.cpp left eos_chat_id = -1 when the GGUF only
   ships tokenizer.ggml.eos_token_id (id 2 = 〈|EOS|〉) without an
   eot_token_id. With eos_chat_id=-1 the decoder check
   `next == eos_chat_id` never matches and the model emits its turn
   </assistant> mark, sees nothing stops it, and re-greets the user.
   Default to 24 (= </assistant>, the chat-template EOT) when the
   GGUF doesn't supply eot — matches the constant already in
   laguna_internal.h.

Validated post-fix: 17×23 prompt returns "Answer: 391" cleanly
(was repeating answer endlessly with the prior bugs).

Sindri/integration: change is in the same chat_template.cpp file
041f491 refactored (thinking_preamble removed); LAGUNA case rewritten
in place. No conflict with 7786b35 (Phase-2 removal) — that touched
http_server only.
The locked file was stale relative to pyproject.toml — uv sync --frozen
failed at docker build, and the fallback uv sync (re-resolve) hit
nvidia-cudnn-cu12 / cu128 version conflicts. Local uv sync produced
this updated lock; committing so docker builds succeed without
needing --no-cache hacks.
… partial

Three snapshots from the 2026-05-25 sweep matrix:

- bragi gemma-4-26b nothink: 72/92 = 78.3%, 42 min wall, 99.6 tok/s
  agg (wall-based). Beats OR-hosted nothink (73.9%) on both quality
  and speed. Run at temp=1.0/top_p=0.95/top_k=64 via the new
  --sampling-from-card flag.

- bragi gemma-4-26b think: 75/92 = 81.5%, 145 min wall, 101.4 tok/s
  agg. +3.2pp over local nothink, +10.5pp over OR-hosted think (71%).
  The new <channel|>\n\n transition cue + 4096 hard_limit_reply_budget
  let thinking-mode runs actually finalize (vs the pre-fix
  hits-length-and-degenerates behavior).

- bragi gemma-4-31b nothink: partial — 8/8 PASS before container
  OOMed at case 9 (697-token prompt's prefill graph needed 123 MiB
  beyond the 23.4/24 GB ceiling). Server-log per-case rate ~19 tok/s.
  Full 31b sweep is impractical on bragi-class consumer 24GB hardware
  (math: 19GB model + 1.6GB draft + ~390 KB/token KV cache = ~5k
  context max).  Per discussion: 31b deferred until larger card
  available.
OR provider routing for poolside/laguna-xs.2:free was ignoring our
existing `thinking:{type}` and `chat_template_kwargs.enable_thinking`
fields. Empirically validated only top-level `reasoning_effort`
(OpenAI-shape) propagates — `reasoning_effort: "none"` cleanly
disables reasoning_tokens emission. Other shapes
(`reasoning:{effort:"minimal"}`, `reasoning:{exclude:true}`,
`extra_body.chat_template_kwargs.enable_thinking:false`) all left
reasoning enabled. Added to run_case body so OR runs no longer
silently keep reasoning on — without this, the 2026-05-24 fill-matrix
laguna data showed identical 55/92 for "think" and "nothink" (both
were actually thinking).

Laguna sweep results, ds4-eval-92, temp=0.6/top_p=0.95/top_k=50,
parallel=1:

- bragi LOCAL Laguna-XS.2-Q4_K_M (with the chat-template fix
  92f84cd + eos_chat_id=24 fallback + Poolside speculator):
  42/92 = 45.7%, true nothink (thk=0 every row), ~118 tok/s decode.

- OR poolside/laguna-xs.2:free BF16 (with reasoning_effort=none):
  47/92 = 51.1%, true nothink (reasoning_tokens=0).

The ~5pp gap is the Q4_K_M quantization vs OR's BF16, plus any
provider-side post-decode handling. Within sampling noise on 92
cases. Confirms the laguna stack on bragi is now correctly tuned.
Closes out the laguna comparison stack. Both backends, both modes,
sequential (parallel=1) for fair comparison:

                    nothink           think
  bragi LOCAL Q4    42/92 = 45.7%    48/92 = 52.2%
  OR :free BF16     47/92 = 51.1%    53/92 = 57.6%
  Δ (OR - bragi)    +5.4pp           +5.4pp
  Δ (think - noth)  bragi +6.5pp     OR +6.5pp

The Q4→BF16 quantization gap is identical (5.4pp) in both modes,
confirming it's pure quant loss rather than a chat-template or
eos handling difference. The think→nothink lift is identical
(6.5pp) in both backends, confirming laguna's thinking mechanism
provides consistent uplift regardless of quantization.

Verifies the 2026-05-25 laguna fix stack works end-to-end:
- chat template (92f84cd) renders <system>/<user>/<assistant>/<think>
  correctly (vs the prior DeepSeek-token garbage)
- eos_chat_id=24 fallback stops the model cleanly mid-stream
- reasoning_effort knob (417009d) lets OR truly disable thinking
- DFlash speculator load works via existing safetensors path
Single AIME case (aime2025-02, correct=588) at 6 budget points
shows the diminishing-returns curve clearly:

- B=512/1024: model can't reach answer before force-close
- B=2048: sweet spot — model self-closes, gets 588 ✓
- B=4096..16384: identical correct answer at 2-3× wall time

Reply phase length is constant ~4k tokens regardless of budget
(hard_limit_reply_budget=4096 governs that). Decode rate constant
~101 tok/s — wall scales linearly with reasoning tokens emitted.

Closes out the original /loop directive's "visibility into the
optimal config for thinking budget" item. Full summary in
_summary.md.
The bench was sending temperature/top_p explicitly with hardcoded
greedy values (temp=0, top_p=1.0) in every request, defeating the
server's card-fallback path at http_server.cpp:761-765:

    req.sampler.temp = body.value("temperature",
        sd.has_temperature ? sd.temperature : 0.0f);

When the bench includes the field, the body.value() never reads the
card default. Effect: gemma4 (card recommends temp=1.0/top_p=0.95/
top_k=64) was forced to greedy on every bench, triggering its known
`- - - -` degenerate-decode collapse (see docs/experiments/gemma4-26b-
thinking-control-2026-05-25.md). The just-completed sindri sweep
shows 3.3% pass on think mode and 47.8% on nothink vs bragi's
81.5% / 78.3% — entirely explained by this miss.

Inconsistent prior behaviour: top_k was already conditional (sent
only when > 0). temperature and top_p were not. This patch makes all
three consistent — sent only when explicitly set via flags. Default
behaviour now is "let the server pick", which means card defaults
for dflash, provider defaults for OpenRouter/Anthropic.

--sampling-from-card is kept as a deprecated no-op so existing
scripts don't break (logs a notice if passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The luce-bench package (https://github.com/easel/luce-bench) v0.2.3
now ships all of the HTTP-based capability benchmarks (ds4-eval,
HumanEval, longctx, agent, forge) plus the thinking-control probe
as a stdlib-only, uvx-installable framework with 53 tests. Folding
luce-hub's duplicate copies in deletes ~20k lines of code that we
were maintaining in two places.

Deleted (Tier 1 — pure duplicates of luce-bench):
  bench_ds4_eval, bench_humaneval, bench_he_http, bench_longctx,
  bench_http_capability (1602 LOC), bench_http_frontiers,
  probe_thinking_control, bench_daemon, bench_server,
  lucebox_bench (987 LOC), test_lucebox_bench (622 LOC),
  bench_agentic_session, bench_agentic_tools

Deleted (Tier 2 — vestigial native-binary dev-cycle benches; the
test_dflash binary itself stays and the run.py / placement / quality
/ parity / examples scripts that drive it are unaffected):
  bench_he, bench_llm, bench_agent, bench_long_ctx,
  bench_agent_loop, bench_agent_cases

Deleted (vendored fixtures now in luce-bench):
  fixtures/forge_eval/  (forge-guardrails 0.7.1 runtime + scenarios)
  fixtures/humaneval/cases.json
  fixtures/ds4_eval_cases.json

Updated:
  pyproject.toml — add `luce-bench` dep pinned to git tag v0.2.3
    (switch to PyPI version pin once trusted-publishing lands); drop
    deleted-file entries from ruff `include` list.
  dflash/pyproject.toml — drop the `eval` extra (anthropic SDK now
    lives behind luce-bench's `[forge]` extra).
  dflash/scripts/entrypoint.sh — `benchmark` subcommand now execs
    `python -m lucebench.cli` so `docker run image benchmark …`
    keeps working with the new framework.
  lucebox/lucebox/profile.py — drop 6 StepDefinitions that
    subprocessed deleted scripts (benchmark.http_frontiers,
    quality.capability_smoke, quality.ds4_eval, quality.capability_long,
    quality.agentic_tools, benchmark.agentic_session). Profile
    registry shrinks from 9 → 3 steps (health.props,
    benchmark.autotune_latest, test.python_unit pointing only at
    the lucebox/tests suite now).
  lucebox/tests/test_profile.py — point removed step IDs at remaining
    steps or delete the test (ds4_eval-specific argv test).
  .gitignore — exclude .claude/ (session worktrees + agent scratch).

Validation:
  uv sync                             → luce-bench==0.2.3 from git
  uv run pytest lucebox/tests -q      → 19/19 passed
  uv run ruff check                   → All checks passed
  uv run ruff format --check (touched) → clean

Sweep continuity: the in-flight 26b --think sweep (luce-bench writing
to baselines/bragi-rtx5090laptop-gemma4-26b-2026-05-26-sweep-think/)
keeps running; no GPU disruption.

Future cleanups (not in this commit):
  - PyPI publish luce-bench, switch `tool.uv.sources` to a plain
    version pin instead of git tag.
  - dflash/docs/{experiments,run-requests,RESULTS}.md still cite
    `bench_*.py` by name — those are historical refs that don't
    affect any runtime, can be swept on a docs-only pass.
The luce-bench-baselines repo (https://github.com/easel/luce-bench-baselines)
is now the canonical home for all benchmark snapshot data. All 41
snapshot dirs (plus the handful of top-level SUMMARY/profile/log files)
that lived under dflash/docs/tuning-snapshots/ have been mirrored over
in luce-bench-baselines commit 067e7c9; this commit drops them from
lucebox-hub so future clones don't carry 43MB of historical data the
server build doesn't need.

New sweeps already land in luce-bench-baselines directly via:

  uvx --from luce-bench luce-bench --sweep --name <host-model-date> \\
    --out-dir /path/to/luce-bench-baselines \\
    --base-url http://<server>:8080

.gitignore picks up the path so an accidental write here stays
untracked.

Mirror is verified: every dir/file removed here exists at the same
name in the baselines repo.
easel added 2 commits May 26, 2026 17:45
The intent shipped in 925d41f's commit message but the file change
got lost — the cat >> happened after the `git add` for that commit.
Adding it now as a one-line follow-up so any future write to that
path stays untracked (snapshots live in luce-bench-baselines).
Brings in 52 upstream commits since merge-base 8c23234 (2 days ago).
The headline is PR Luce-Org#269 (`403e598 feat(cpp-server): thinking-budget
v2 + multi-dialect reasoning + model-card sidecars`) — the squashed
version of our own thinking-budget v2 work that we'd been carrying
as 50+ small commits on this branch. Plus:

- PR Luce-Org#262 howard0su/powerinfer: hybrid MoE spec-decode with DFlash
  draft, GPU/CPU FFN overlap, persistent pre-FFN graph for DeltaNet,
  4-5x decode speedup, MoE perf telemetry
- PR Luce-Org#263 weicj/feat-cpp-server-pflash-draft-placement: mixed-backend
  PFlash phase split
- 648a6e2 perf: GPU-resident hybrid decode (eliminate PCIe round-trips)
- d12ddde fix: dynamic placement uses --max-ctx instead of hardcoded
  8192

Conflict resolution: where this branch has post-Luce-Org#269 refinements
(gemma4 timings via 4e9abda, degenerate-decode watchdog via c2d725f
/ 8538ff9, laguna chat-template fix via 92f84cd, transition cue via
16bb31e, thinking-budget force-close fix via b86342d, etc.), keep
ours. Where the only divergence is "this branch has 50 small commits
that 403e598 squashes," accept the merged result.

# Conflicts:
#	dflash/scripts/server.py
#	dflash/src/qwen35/qwen35_backend.cpp
#	dflash/src/qwen35/qwen35_backend.h
#	dflash/src/server/http_server.h
#	dflash/src/server/sse_emitter.cpp
#	dflash/test/test_server_unit.cpp
#	share/model_cards/laguna-xs.2.json
@davide221

Copy link
Copy Markdown
Contributor

@easel great work! I would like to promote to main asap. Can you rebase and integrate with harness autorun scripts?

easel and others added 14 commits May 26, 2026 19:52
…name)

Brings in PR Luce-Org#281 (chore: rename dflash→server, pflash+megakernel
→ optimizations/) + small docs polish 080f89b.

Our lucebox/ Python package (added by us in 2560086, never upstream)
is untouched. Our docs additions under dflash/docs/* are migrated to
server/docs/*. Our deletions of bench scripts confirmed against the
new server/scripts/* paths.

Workspace members in pyproject.toml: ["server", "lucebox",
"optimizations/megakernel", "optimizations/pflash"] — preserving
our lucebox member alongside upstream's renamed paths.

# Conflicts:
#	README.md
#	pyproject.toml
#	server/docs/BENCHMARK_SNAPSHOT_SPEC.md
#	server/docs/experiments/cache-impact-2026-05-24.md
#	server/docs/experiments/gemma4-26b-thinking-control-2026-05-25.md
#	server/docs/experiments/kv-cache-q4-vs-tq3-2026-05-25.md
#	server/docs/experiments/thinking-control-protocol.md
#	server/docs/experiments/thinking-mechanism-explainer.md
#	server/docs/run-requests/area-swe-bench-integration.md
#	server/docs/run-requests/bragi-gemma4-laguna-config-issues.md
#	server/docs/run-requests/forge-vs-vidar-ds4f.md
#	server/docs/run-requests/luce-dflash-think-92.md
#	server/docs/run-requests/qwen36-budget-signaling-overhaul.md
#	server/docs/run-requests/qwen36-hard-limit-reply-budget-bump.md
#	server/docs/run-requests/sindri-rtx3090ti-qwen36-nothink-92.md
#	server/scripts/bench_agent.py
#	server/scripts/bench_agent_loop.py
#	server/scripts/bench_daemon.py
#	server/scripts/bench_he.py
#	server/scripts/bench_he_http.py
#	server/scripts/bench_llm.py
#	server/scripts/bench_server.py
#	server/scripts/entrypoint.sh
#	server/scripts/fixtures/agent_cases/cases.json
#	server/scripts/server.py
#	server/scripts/test_prefix_cache.py
#	server/scripts/test_server.py
v0.2.4 fixes the broken `--area forge` path that was raising
`TypeError: EvalConfig.__init__() got an unexpected keyword argument
'client_factory'` against any HTTP backend. Upstream commit
easel/luce-bench@59e01fc realigns areas/forge.py with the refactored
EvalConfig dataclass + run_scenario(client, scenario, config) signature.

No other behaviour changes in v0.2.4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The standalone github.com/easel/luce-bench repo was an awkward split
for what is really part of the same engineering surface as the server.
Bringing it in-tree so PRs, CI, and reviews live alongside the server
work that the benches exercise.

Layout:
- luce-bench/ is now a uv workspace member alongside server, lucebox,
  optimizations/megakernel, optimizations/pflash.
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the
  prior `git = "https://github.com/easel/luce-bench.git", tag = "v0.2.4"`
  pin. Future bumps land here directly.
- luce-bench keeps its own pyproject.toml with `name = "luce-bench"`
  and a `version` it manages independently of the monorepo's release
  cadence — PyPI sees the same package name.

Release flow (.github/workflows/release-luce-bench.yml):
- Triggered on tag pushes matching `luce-bench-v*` (e.g. luce-bench-v0.2.5).
- Asserts the tag's version suffix matches `luce-bench/pyproject.toml`.
- Builds wheel + sdist from luce-bench/, publishes via PyPI trusted
  publishing (OIDC) under the `pypi` environment. Set up the trusted
  publisher in the PyPI project once: repo=easel/lucebox-hub,
  workflow=release-luce-bench.yml, environment=pypi.

The standalone repo will be archived (read-only) — existing tag pins
keep resolving for anyone consuming v0.2.4 or older from there. Files
copied at v0.2.4 (commit easel/luce-bench@59e01fc): src/, tests/,
README.md, NOTICE, LICENSE, pyproject.toml.

53 lucebench tests pass against the in-tree workspace install.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bench-script cleanup (89b1dfe) left profile.py hollowed out — its
6 benchmark StepDefinitions were subprocessing scripts that no longer
exist, leaving only 3 trivial probes (health.props, autotune-report
read, pytest). The snapshot/hash/dedup machinery was sitting on a
near-empty registry.

The whole point of `lucebox profile` is "capture performance snapshots
that feed autotune" — that's exactly what luce-bench now produces.
Wire the framework to it instead of deleting it.

Adds 4 luce-bench-driven StepDefinitions to the registry:
  - benchmark.code      — `lucebench.cli --area code` (HumanEval, 10 cases)
  - benchmark.longctx   — `lucebench.cli --area longctx` (6 cases)
  - benchmark.agent     — `lucebench.cli --area agent` (4 cases, mt=4096)
  - quality.ds4_eval    — `lucebench.cli --area ds4-eval --think
                          --max-tokens 16000 --timeout 1800`
                          (full 92-case suite, score-only)

Each one builds its argv via the new `_luce_bench_area_argv()` factory.
Output JSON lands in the framework-owned dest dir so the existing
snapshot/hash/dedup pipeline ingests it without changes. The framework's
content-addressed hash (by hardware + model + tunables) is unchanged —
re-running profile with the same config short-circuits to the cached
snapshot; changing any tunable forces a fresh capture.

Test coverage:
  - test_luce_bench_argv_shape_for_each_area — verifies argv shape for
    all 4 luce-bench steps + the per-step knobs (think mode, max_tokens,
    timeout, model). Catches regressions in the argv builder without
    needing a live server.

Plus a small Dockerfile fix: UV_PYTHON_INSTALL_DIR=/opt/uv/python so
the venv's python interpreter is world-readable. Default location is
`/root/.local/share/uv/python/` which non-root container UIDs cannot
traverse — broke `lucebox.sh check` and every other host-wrapper
subcommand. The container runs as the host UID for bind-mount sanity
(config files in $HOME stay user-owned), so the interpreter has to
live somewhere world-traversable.

Lucebox tests: 20/20 pass. Ruff + format clean.
PR Luce-Org#281 moved dflash/ → server/. The pull_request `paths:` filter
still targeted dflash/* — so PRs touching the C++ server code
wouldn't trigger the Docker prebuild sanity check. Repoint to
server/ so CI catches Dockerfile / source regressions before merge.
server_main.cpp had two identical 119-line blocks of the thinking-budget
v2 model-card resolution code (general.name / general.architecture
read, resolve_model_card(), ServerConfig application, tier clamping).
g++ errored out on redeclarations:

    redeclaration of 'std::string general_name'
    redeclaration of 'std::string general_arch'
    redeclaration of 'dflash::common::ModelCard card'
    redeclaration of 'const int tier_ceiling'
    conflicting declaration 'auto clamp_tier'

The duplicate was a merge artifact from 1df9099 (luce-org/main into
integration/-clean post-rename). Upstream PR Luce-Org#269 squashed our pre-PR
work into 403e598 and our integration branch carried the same code in
unsquashed form; the 3-way merge kept both copies.

Also drop a redundant `#include "gguf.h"` (lines 23 and 25 of the same
file). Harmless thanks to include guards but ugly merge residue.

Build proceeds past cmake configure with the dedupe.
GitHub Actions only picks up workflows from the repo-root
`.github/workflows/`; the nested `luce-bench/.github/workflows/ci.yml`
was inherited from the standalone repo but never fires here. Its
publish job is also superseded by `release-luce-bench.yml`.

The nested `.gitignore` mostly duplicated root entries; moved its one
unique pattern (`luce-bench/snapshots/` for --sweep output) into the
root `.gitignore`.

Also fixes a stale `-> "_RecordingAnthropicClient"` forward reference
in areas/forge.py that the root ruff configuration flags (the class
is in scope where the annotation is evaluated; the quotes are dead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 490ff95 absorbed luce-bench into the monorepo as a uv workspace
member, but didn't update the Dockerfile. The runtime stage's `uv
sync` would fail looking for /opt/lucebox-hub/luce-bench/pyproject.toml
because:

  1. The builder stage's COPY block (~line 115-120) listed lucebox,
     optimizations/pflash, optimizations/megakernel — but not
     luce-bench. uv sync needs every workspace member's pyproject.toml
     to be present in the build context.
  2. The runtime stage's COPY --from=builder block (~line 168) only
     pulled lucebox, server, optimizations across — not luce-bench.

Add the two COPYs (builder source + runtime install) so the workspace
resolution path is complete. No changes to cmake stage; CUDA build
cache should still hit.
Wraps luce-bench in the same start-server → run-client → save-logs →
stop-server pattern as the other harness/clients/run_*.sh launchers
(run_codex.sh, run_claude_code.sh, etc.). Since luce-bench is just an
HTTP client of /v1/chat/completions, it fits the existing client
abstraction natively.

Why: operators get a uniform way to invoke luce-bench ("did this
server change break it?") alongside real-client smoke tests. A
regression in luce-bench surfaces in the harness sweep matrix the
same way an OpenCode or Hermes regression does — same launcher
contract, same logs, same finish_report shape.

Defaults: --no-think (4x faster on gemma-4-26b per the 2026-05-26
think/nothink comparison), full sweep mode (--sweep), 300s per-case
timeout, single-thread.

Knobs (env):
  LUCEBENCH_AREA        single area override (else --sweep)
  LUCEBENCH_THINK       1 → --think, 0 → --no-think (default 0)
  LUCEBENCH_MAX_TOKENS  per-request decode cap override
  LUCEBENCH_TIMEOUT     per-case wall timeout (default 300s)
  LUCEBENCH_PARALLEL    in-flight concurrency (default 1)

All harness/common.sh knobs apply (MODEL_SERVER, LUCEBOX_SERVER_BACKEND,
MAX_CTX, BUDGET, EXTRA_SERVER_ARGS, etc.).

Output: $LOG_DIR/lucebench-{area,sweep}.{json,md} + lucebench.out
(stdout/stderr) + server.log. Slots into the existing run-dir layout
under /workspace/lucebox-client-harness-runs/<stamp>/.

Docs entry added to harness/clients/README.md.
Promotes harness/ from "loose shell scripts + one stdlib Python file"
to a proper uv workspace member that owns the "run X against a Lucebox
server" abstraction. Both lucebox profile and harness/clients/*.sh
launchers now go through the same Python entry point.

What's new:

- harness/pyproject.toml — name = "harness", dependencies = ["luce-bench"].
  Stdlib-only at runtime (luce-bench itself is stdlib-only; anthropic
  only via [forge] extra). Fresh test boxes can install with zero
  external wheel downloads.

- harness/harness/ package:
  - bench.run_bench(base_url, area, ...) — Python function form of
    harness/clients/run_lucebench.sh. Composes the lucebench.cli argv
    internally. Single source of truth for "run a luce-bench area
    against a server", returns the parsed JSON snapshot.
  - clients/claude_code.launch(base_url, model, prompt, interactive)
    + claude_env() helper — the env-var contract that points Claude
    Code at a Lucebox server (ANTHROPIC_BASE_URL, telemetry-off
    knobs, NONSTREAMING_FALLBACK kill). Used by both the new
    `lucebox claude` subcommand and the existing
    harness/clients/run_claude_code.sh wrapper.

- lucebox/lucebox/cli.py — new `claude` subcommand. Probes for a live
  /health, looks up base URL, exec's claude on the host with the
  right env via harness.clients.claude_code.launch. Interactive by
  default (full TUI); `--prompt` makes it a one-shot run.

- lucebox/lucebox/profile.py — _luce_bench_area_argv now delegates
  to `python -m harness.bench` instead of `python -m lucebench.cli`.
  All four bench StepDefinitions (code, longctx, agent, ds4-eval)
  ride that path. Framework still owns the JSON snapshot path via
  --json-out (new arg added to harness.bench too).

- pyproject.toml — adds harness to workspace members + sources,
  declares harness as a root dep so uv sync installs it.

Tests:
  - lucebox/tests/test_profile.py — updated argv-shape assertion for
    every bench step: expects `harness.bench` instead of `lucebench.cli`.
  - All 20 lucebox tests pass. luce-bench's 53 tests still pass.
  - Ruff + format clean across lucebox/ + harness/.

Validation (live, against newly-rebuilt gemma-4-26b server):
  - lucebox-hub:cuda12 image rebuilt with permission fix +
    workspace luce-bench + harness COPY (separate Dockerfile commit).
  - `docker run lucebox-hub:cuda12 serve` brought gemma up in ~15s.
  - `python -m harness.bench --area forge --base-url http://localhost:8080`
    ran 30 cases against gemma in both think + nothink modes
    (0/30 both, ValidationError — separate luce-bench[forge]
    adapter bug, not a regression of this PR).

This is the "interface luce-bench through the harness" shape the
README hinted at — one Python module, one CLI, one shell wrapper,
all converging on the same env-config + argv-building logic.
      + Dockerfile COPY harness + run_lucebench.sh via harness.bench

Builds on 7bbf9af (harness as workspace member). All six harness/clients/
launchers now exist as both shell wrappers AND Python modules with the
same launch() contract. Five new `lucebox <client>` Typer subcommands.

New launchers in `harness.harness.clients`:
  - codex      (writes config.toml; Responses API wire format)
  - opencode   (writes opencode.json; AI-SDK OpenAI-compatible provider)
  - hermes     (writes config.yaml + .env; chat_completions wire format)
  - pi         (writes agent/{settings,models}.json; openai-responses)
  - openclaw   (writes JSON config patch merged at startup)

Each module:
  - Resolves binary via $<X>_BIN env, $PATH, or test-box convention
    ($CLIENT_WORK_DIR/clients/<x>/...) — shared `_common.find_bin()`
  - Writes per-run config into a tempdir so the user's real client
    state is untouched
  - exec()s with stdio inheritance (interactive TUI) or stdin from
    /dev/null + optional timeout (non-interactive `--prompt`)
  - Provides a `main()` for ad-hoc CLI use (harness-codex, etc.)
  - Stdlib-only at runtime

Five new `lucebox <client>` Typer subcommands in lucebox/lucebox/cli.py:
  - `lucebox codex|opencode|hermes|pi|openclaw [--prompt P] [--url U] [--model M]`
  - Shared `_detect_server_url()` probes the standard localhost/docker
    bases for /health, picks the first responder
  - Shared `_exec_client()` does the launcher dispatch + typer.Exit
    translation
  - Each subcommand is ~10 lines: import the launcher, call the helper

Dockerfile:
  - COPY harness /src/harness in the builder stage (alongside lucebox,
    luce-bench, optimizations/{pflash,megakernel}) so uv sync resolves
    the workspace member
  - COPY --from=builder /src/harness /opt/lucebox-hub/harness in the
    runtime stage so profile.py inside the container can
    `python -m harness.bench` (the path it uses since 7bbf9af)

harness/clients/run_lucebench.sh:
  - Switch the underlying call from `python -m lucebench.cli` to
    `python -m harness.bench` for consistency with profile.py's
    delegation path. Both go through harness.bench now.

Tests: 20/20 lucebox tests still pass. Ruff + format clean.
Out of scope (deferred):
  - openwebui + openwebui-tools — separate web service lifecycle
    (start-server-in-background, poll for ready, etc.) — port as a
    follow-up if needed
  - lucebox.sh host-side dispatch — currently `lucebox.sh <client>` routes
    into the container (where the client binary isn't installed); need a
    host-side `cmd_client` that runs the client binary on the host. Works
    today via `uv run python -m lucebox <client>` directly.
…/LICENSE)

uv builds harness inside an isolated sandbox at /src/harness/, where the
parent ../LICENSE file is not visible. hatchling errored out:

    OSError: License file does not exist: ../LICENSE
    hint: `harness` was included because `lucebox-hub` (v0.0.0) depends on `harness`

Switch to inline `license = { text = "Apache-2.0" }` (same pattern lucebox/
and server/ use). Matches PyPA 2025 guidance and avoids the sandbox-path
trap. The text of LICENSE itself stays at repo root.
Old auto-detect had a hardcoded Qwen3.6 preference: when both
Qwen3.6-27B-Q4_K_M.gguf and gemma-4-26b-a4b-it-Q4_K_M.gguf were
present in models/, Qwen always won, silently. This hid a real bug
during the 2026-05-27 matrix run — the container was supposed to
serve gemma4 but ran qwen for 10 hours of sweep-think before the
operator noticed bench numbers were wrong (and the wrong-target draft
slowed decode 4× because the qwen draft GGUF was q4_k_m vs gemma's
q8_0, causing every think-mode case to hit the 300 s bench timeout).

New behavior:
  - Find ALL .gguf candidates ≥5 GB outside models/draft/ (the size
    threshold cleanly excludes draft GGUFs without parsing GGUF
    arch metadata).
  - 0 candidates → die with clear message + DFLASH_TARGET hint
  - 1 candidate → use it, log "Auto-detected target: <name>"
  - 2+ candidates → WARN with the full list (marking the choice with
    *), tell the operator to pin DFLASH_TARGET=<path>, then pick the
    first alphabetically (deterministic across runs).

Trade-off: gemma4 wins over qwen3.6 alphabetically. That's not a value
judgment — it's just deterministic. The point is the warn-loudly path,
not the choice of which model wins by default. Operators with both
models present MUST set DFLASH_TARGET to skip the warning.

The hardcoded Qwen3.6 family preference path is gone — fundamentally
the wrong shape (silently picking based on a name pattern). If we
want a "preferred model" knob later it should be DFLASH_PREFERRED_TARGET
or similar, with the same warn-when-multiple-candidates rule.
@easel

easel commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by a flattened single-commit branch feat/lucebox-docker — same scope (docker stack + lucebox CLI + bench/profile + harness + luce-bench in-tree), but collapsed onto current main since most of the original branch's server-side work landed separately via #269 / #281 / #262.

New PR forthcoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants