docker, cli, smoke, bench and autotune first-run dx#226
Conversation
There was a problem hiding this comment.
14 issues found across 37 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
4d38d50 to
4621867
Compare
|
Addressed the review issues identified by cubic in commit 067f4ac.\n\nLocal verification:\n- uv run --frozen --extra dev ruff check .\n- uv run --frozen --extra dev python -m mypy --package lucebox\n- uv run --frozen --with pytest pytest dflash/scripts/test_lucebox_bench.py dflash/scripts/test_server.py lucebox/tests -q\n- bash -n lucebox.sh && bash -n dflash/scripts/entrypoint.sh\n- docker buildx bake --print cuda12\n- docker buildx bake --print cuda12-local |
0067f9b to
743da47
Compare
5c6b502 to
84ddd04
Compare
Closes two of the three feature gaps between dflash/scripts/server.py
(Python, reference impl) and dflash/src/server/dflash_server (C++,
production runtime), as outlined in the migration plan.
## /props introspection
Wire /props in http_server.cpp::handle_client. JSON shape matches
server.py:1221-1312 key-for-key so cross-server consumers (autotune
sweeps, dashboards, lucebox profile/snapshot) see a stable contract.
ServerConfig grows the introspection inputs (arch, model_path,
draft_path, kv_cache_k/v, runtime_backend, fa_window, ddtree_budget,
speculative_enabled, target_sharding, tokenizer_id) populated by
server_main before HttpServer construction.
PrefixCache and ToolMemory gain stats() / full_stats() accessors with
the same lockless-snapshot semantics the Python impl documents — a
mutation under daemon_lock can tear in_use vs lifetime_hits across the
read pair; acceptable for /props.
PROPS_SCHEMA = 1 (matches Python's current schema). Bump only on
breaking changes: field renamed, removed, or semantics-changed.
## /v1/messages/count_tokens
Reuses the Anthropic message-parsing path from /v1/messages, short-
circuits after tokenization with {"input_tokens": N}. <1s on a hot
server (no generation).
## Tests
Integration coverage in dflash/scripts/test_server_integration.py:
- TestProps: top-level keys, server block shape, speculative_mode
consistency, runtime backend, arch-gated capabilities, API endpoint
registry, prefix_cache/tool_replay shapes.
- TestCountTokens: simple count, scaling with message length, system
block handling, <1s budget assertion.
Thinking-budget surface (--think-max-tokens flag, finish_details,
thinking_opt_in tracking) is the second PR in this migration —
intentionally separated so the algorithm-design review (Level 1/2/3
fidelity question) doesn't block /props from landing.
Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean with
CUDA layers cached.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_details)
Closes the third feature gap in the Python→C++ server migration: the
thinking-budget wire surface so consumers (lucebox bench, dashboards)
can opt in to the envelope and see the close-info block on responses.
## What ships
- `--think-max-tokens` CLI flag (default 10000): cap on the phase-1
reasoning generation when a request opts in.
- `--default-max-tokens` CLI flag (default 16000, matches
antirez/ds4 ds4_eval.c's `.max_tokens`): combined cap used when a
request omits max_tokens.
- `thinking: {type: "enabled"}` is now tracked as a presence-opt-in
(`ParsedRequest.thinking_opt_in`) so the server can condition
response shape on it.
- `finish_details` block on /v1/chat/completions responses when
thinking is opted in. Fields match docs/specs/thinking-budget.md:43-58
and server.py:2272 (close_kind, thinking_tokens, content_tokens,
total_tokens).
## What's deferred (intentional)
The phase-1/phase-2 reprompt MECHANISM (port of server.py:2141-2196)
is not in this PR. close_kind always reports "natural" for now —
the C++ server doesn't yet force-close on hard_limit. The Python
server's existing behavior continues to be the reference impl.
Why ship the surface first:
- Unblocks consumers (the lucebox bench can stop sending custom
envelope fields and just send standard OpenAI + this opt-in).
- Lets us land /props first without algorithm-design review on this PR.
- The Level 1 vs Level 2 vs Level 3 fidelity question (phase-1/phase-2
reprompt vs true mid-stream force-close vs full eval_think_close_info
reporting) is a separate design conversation — the wire shape stays
the same regardless of which Level lands.
## Tests
Integration coverage in test_server_integration.py:
- TestThinkingBudget.test_finish_details_present_when_thinking_opted_in:
asserts the block is emitted with valid types when
`thinking:{type:enabled}` is sent; invariant
`thinking_tokens + content_tokens == total_tokens`.
- TestThinkingBudget.test_finish_details_absent_when_thinking_not_opted_in:
asserts the block is NOT emitted on a plain chat request, matching
Python's server.py:2271 conditional.
## Follow-ups (separate PRs)
- Level 1 phase-1/phase-2 reprompt: port server.py:2141-2196 into
worker_loop. Sets close_kind="hard" when phase-1 didn't emit
</think> within think_max_tokens. ~1 day.
- Level 2 true in-process force-close: backend-level sampler hook,
matches ds4_eval.c:3027-3056 hard_limit_reply_budget semantics.
Adds --soft-limit-reply-budget / --hard-limit-reply-budget flags.
Closes the OpenRouter→native pass-rate gap (~20 pts). ~2-3 days.
Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles three pieces:
## Level 1 thinking-budget mechanism (port of server.py:2141-2196)
When a request opts in via `thinking: {type: "enabled"}` and the model
fails to emit `</think>` within `--think-max-tokens`, the server now
caps phase-1, decodes the reasoning, appends `</think>\n\nFinal answer: `,
re-prefills, and runs phase-2 for `max_tokens - phase1_emitted` tokens.
Non-streaming only. Streaming phase-2 is a follow-up (needs SSE flush
+ re-open). finish_details.close_kind now flips "natural" → "hard" when
phase-2 fires; thinking_tokens / content_tokens report the real split.
This brings C++ to Python parity for thinking-budget enforcement and
should close most of the 26-percentage-point gap to the Python server's
ds4-eval pass rate that the bench has been measuring.
## Codex review fixes (against 3f600f9 + 8d6ff04)
1. /props `speculative_mode` was reporting "dflash" based on arch
capability instead of `--ddtree`-active state, contradicting the
`speculative.enabled` block in the same response. Now keyed on
`config.speculative_enabled`.
2. `--default-max-tokens` was a dead flag: the request parser fell
back to `config_.max_tokens` (legacy 4096) when clients omitted
max_tokens, so the new 16000 default was never applied. Parser now
reads `default_max_tokens` directly. Side effect: entrypoint.sh
doesn't need to explicitly pass `--default-max-tokens` either; the
16000 default is now what the server uses out of the box, restoring
parity with the Python server's documented default.
3. Phase-2 reprompt could exceed `max_ctx` when the prompt already
sits near the boundary, because ph2_prompt grew by phase1_tokens +
closing_ids but ph2_gen_len was only clamped by remaining budget.
Added a second clamp against `max_ctx - ph2_prompt.size() - 20`.
## Tests
`TestThinkingBudget.test_close_kind_natural_when_model_self_closes`
(easy prompt, model self-closes well under cap; asserts
close_kind=="natural"). Hard-close test deferred — requires a server
launched with very low `--think-max-tokens`, which the integration
suite doesn't currently parametrize.
## What's not in this commit
- `finish_details` block stays as-is. Codex's review didn't flag it
as a problematic protocol invention despite a targeted prompt;
reconsidering it is a separate cleanup if needed.
- Streaming phase-2 — Level 1.5 follow-up.
- Level 2 (in-process force-close via backend sampler hook) — next PR.
- Test coverage gaps around phase-2 hard-close path, multi-arch
thinking-tag mapping (Gemma4 `<|channel>`), stop-sequence behavior
across phases. Tracked for the follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… limit `semantic_hint_present` in bench_http_capability.py parsed all `-?\d+` matches in the model output via `int(match.group(0))`. A degenerate model emission of a many-thousand-digit number trips Python 3.11+'s default 4300-digit int() limit and crashes the bench mid-run. Cap match length at 20 digits before parsing — real answers never exceed that, and longer runs can't ever equal an expected answer anyway. Surfaced as a hard crash on row 8 of the local lucebox v2 ds4-eval run yesterday. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The C++ chat template was matching Qwen3.5's behavior (let the model
decide whether to emit <think>) when enable_thinking=true. Qwen3.6's
official chat_template.jinja actually pre-opens the thinking block:
enable_thinking=true → suffix `assistant\n<think>\n`
enable_thinking=false → suffix `assistant\n<think>\n\n</think>\n\n`
Without this prefix, requests that opted into thinking via
`thinking:{type:enabled}` or `chat_template_kwargs.enable_thinking=true`
silently stayed in non-thinking mode on Qwen3.6 — the model would
answer directly without reasoning, no </think> tag ever appeared, and
the Level 1 phase-2 reprompt mechanism never fired because the
started_in_thinking flag never flipped true.
Verified against transformers.AutoTokenizer for Qwen/Qwen3.6-27B:
>>> tok.apply_chat_template(msgs, enable_thinking=True,
... add_generation_prompt=True, tokenize=False)[-20:]
'<|im_start|>assistant\\n<think>\\n'
Hot fix — caught while smoke-probing the freshly-rebaked C++ image
before running the full ds4-eval bench. Without this, the bench would
have produced numbers no better than non-thinking mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a standalone benchmark script that compares Q4_0 vs TQ3_0 KV cache at 32K / 64K / 128K context lengths, measuring both prefill time and DFlash decode tok/s. The point isn't throughput (TQ3 is slightly slower than Q4_0 at short contexts) — it's the memory saving: TQ3_0 (3.5 bpv) uses 22% less KV than Q4_0 (4.5 bpv), enabling longer contexts on the same VRAM budget. Automatically uses layer-segmented prefill (DFLASH27B_LAYER_PREFILL=1) for prompts over 8K tokens to reduce peak activation memory. Ported from feat/setup-results-uv@c725758; the README/RESULTS.md changes from that commit were dropped (clean has richer/newer versions). Only the new bench script is retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ax_tokens banner Three fixes prepping for the phase-2 trigger reproducibility matrix (codex review plan step 1-3): ## http_server.cpp — phase-2 gate-input logging Per-request stderr line printing all 8 inputs to the phase-2 trigger condition: thinking_opt_in, started_in_thinking, stream, client_disconnected, phase1_tokens.size(), result.ok, req.max_output, phase1_cap, ph2_gen_len_est, close_in_phase1 (decode), and the last ~10 tokens of effective_prompt decoded. Lets us correlate probe-vs-bench, cache-on-vs-off, pflash-on-vs-off behavior when phase-2 fires inconsistently. Strip after Level 1 is stable. ## entrypoint.sh — draft path resolution (F1) The native dflash_server expects --draft to be a FILE path. The Python server's resolve_draft() walked the dir to find a GGUF; the C++ entrypoint was passing the directory directly, producing `draft load: mmap: No such device` on container startup. Resolve DFLASH_DRAFT to the largest dflash-draft-*.gguf inside the dir before passing to dflash_server. Removes the need for users to set DFLASH_DRAFT to a specific file path explicitly. Also includes the in-flight migration plumbing that was sitting in WT: - entrypoint header comment updated to "execs the native dflash_server" - `DFLASH_SERVER_BIN` env default - existence check now targets `$DFLASH_SERVER_BIN` instead of `$DFLASH_BIN` ## server_main.cpp — max_tokens banner truth + --prefix-cache-slots The startup banner printed `max_tokens = 4096` (legacy sconfig.max_tokens default) even though the request parser actually defaults to default_max_tokens=16000. Now prints default_max_tokens with a label explaining it's the request-omit default, plus a separate line for think_max_tokens (the phase-1 cap when opted in). Codex review #2. Also includes the in-flight --prefix-cache-slots CLI flag + parser branch + startup-log line that was sitting in WT (used by entrypoint.sh to forward $DFLASH_PREFIX_CACHE_SLOTS to the C++ server). ## Note on excluded WT Dockerfile and lucebox/lucebox/docker_run.py also have in-flight changes from this WT but they're orthogonal to the phase-2 diagnostic work — leaving them unstaged for whoever owns them to commit separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches dflash/scripts/bench_ds4_eval.py's DS4_EVAL_MAX_TOKENS = 16000, which mirrors antirez/ds4 ds4_eval.c's .max_tokens default. At 4096 the combined reasoning+reply budget truncated mid-CoT on harder cases — AIME2025 wall times of 60-400s with garbage numeric answers in the May-21 interrupted trace are consistent with reasoning running into the cap.
Local dev builds typically only need the host's compute capability, not the full 6-arch fat-binary the release image carries. The Dockerfile already accepted DFLASH_CUDA_ARCHES as a build arg; this just promotes it to a bake variable so a single env-var override skips the 5-6× CUDA template recompile. Example (RTX 5090 Laptop = sm_120, ~3min instead of ~20): DFLASH_CUDA_ARCHES=120 docker buildx bake cuda12-local --load Default unchanged (75;80;86;89;90;120) so CI + release images get full coverage without setting anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t 15488 Plumbs a new env var through the launch chain so the server's thinking-phase cap matches upstream antirez/ds4 ds4_eval.c by default. The dflash_server binary defaults --think-max-tokens to 10000 internally; nothing in lucebox/docker_run.py + entrypoint.sh was setting it, leaving a 35% gap vs upstream (15488 = 16000 max_tokens - 512 hard_limit_reply_budget). For ds4-eval this manifests as truncated reasoning on AIME / hard GPQA cases — the May-21 partial trace's nonsense AIME answers (case 51's 1e80-shaped output) were consistent with hitting the 10000 cap mid-CoT. Chain: - types.py: add think_max field to DflashRuntime (default 15488) - config.py: parse think_max from [dflash] section - docker_run.py: emit DFLASH_THINK_MAX in server_run_spec + benchmark_run_spec - profile.py: track think_max in runtime_tunables + live_tunables - entrypoint.sh: default DFLASH_THINK_MAX=15488, pass --think-max-tokens N Tests updated for the prior max_tokens 4096 -> 16000 fix that also moved.
The size-sorted resolver in 3e8323d picked model.safetensors (3.4GB HF raw weights) over dflash-draft-3.6-q4_k_m.gguf (1GB DFlash draft) because the safetensors file is bigger. dflash_server's spec-decode verify then crashes with `verify_batch: embed failed (n=16)` because the safetensors isn't in the format the DFlash draft path expects. Replace size-sort with priority-ordered pattern match: 1. dflash-draft-*.gguf (the canonical DFlash quantized draft) 2. *.gguf (any other GGUF — last-resort) 3. model.safetensors (HF raw — only if no GGUF at all) 4. *.safetensors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3.6's <think> (id 248068) and </think> (id 248069) are single
special tokens in the added_tokens vocab. Both http_server.cpp paths
(streaming on_token + non-streaming feed_tokens) had explicit handling
for Gemma4's <|channel>/<channel|> but NOT for Qwen's variants. The
generic "skip <...>" filter silently dropped the Qwen tokens, so the
emitter never saw the reasoning→content transition.
Symptom: ds4-eval scored 7/92 against the C++ server. 29 cases had
close_kind="natural" + content_tokens=0 — model emitted </think>
token 248069 but emitter dropped it, all 15488 tokens ended up in
reasoning_content with empty visible content. Bench's answer extractor
saw `content=""` and reported `given=? format=False`.
After fix: when model emits token 248068/248069, forward the text form
("<think>" / "</think>\n") into the emitter so parse_reasoning splits
the response correctly. The phase-2 trigger's close-detection (which
already uses decode() and sees </think> from special-token decoding)
keeps working identically — only the response builder changes.
Also explains why close_kind="hard" cases (phase-2 fired) DID pass:
the phase-2 emit_token("</think>\n\nFinal answer: ") manually injects
the close text, bypassing the missing-mapping bug.
Validated against the failing case's logs: model emits 248069, decoder
returns "</think>", emitter now transitions and content gets populated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Response shape — multi-dialect aliasing" section to v2 that formalizes what dflash should emit for reasoning across DeepSeek / OpenRouter / Anthropic / OpenAI-shaped clients. Motivation: 2026-05-23 ds4-eval cross-server comparison against OpenRouter qwen/qwen3.6-27b showed the bench discarding ~4000 reasoning tokens because OR emits them under message.reasoning while dflash and the bench use message.reasoning_content. Same data, different field name — current dflash impl is correct but parochial. Plan: - Keep reasoning_content as primary (no break). - Add reasoning as a flat alias. - Add reasoning_details as a typed-block list — single-block today, room for phase-1/phase-2 splits later. - Surface phase1_tokens / finish_details.thinking_tokens into usage.completion_tokens_details.reasoning_tokens to match OpenAI o1/o3 shape and OR's normalized location. - Patch bench_http_capability.py fallback chain to read all three. Comparison table, full example response, per-field notes, motivation, and implementation status (all four pieces "planned, not yet shipped") included. No code change in this commit.
Implements antirez/ds4 ds4_eval.c's hard_limit_reply_budget at the
backend sampling layer. When a thinking-enabled request approaches
the budget boundary, the next sampled token is overridden with the
`</think>` close-tag token, giving the model the remaining budget
to write a visible answer with KV state continuous.
Beats Level 1 phase-2 reprompt because the model never sees a fresh
prefill — its reasoning context stays in KV cache and the answer
flows naturally after the injected close.
## Changes
- `common/model_backend.h`: new BudgetHook struct (close_token_id +
hard_limit_remaining), opt-in field on GenerateRequest. Other
backends ignore it (default-constructed = disabled).
- `qwen35_backend.{h,cpp}`: do_ar_decode honors budget_hook. Override
fires once per generation (budget_close_injected flag prevents
double-injection on subsequent loop iterations). Logs to stderr
with [budget-hook] tag when it fires.
- `qwen35_backend::generate`: when budget_hook is set, routes through
AR instead of spec-decode. Spec-decode integration is a follow-up
(the perf hit is acceptable since this only affects thinking turns).
Non-thinking turns still get full spec-decode throughput.
- `server_main.cpp`: new `--hard-limit-reply-budget N` flag (default
512, matches ds4_eval.c). At startup, tokenizes "</think>" and
caches the token ID in ServerConfig.think_close_token_id (only
populated if it's a single token — Qwen3.6 = 248069). Multi-token
close tags disable Level 2 with a warning, falling back to Level 1.
- `http_server.{h,cpp}`: ServerConfig grows hard_limit_reply_budget
+ think_close_token_id. worker_loop wires BudgetHook into
GenerateRequest when the request opts into thinking AND the server
has both knobs set.
## What it gives us
The model continues generating after the synthetic </think>, so its
visible content includes the actual answer in the budget remainder.
Mirrors ds4_eval.c's mid-stream injection exactly. Level 1 phase-2
reprompt remains as a fallback when force-close didn't fire (e.g.
model closed </think> early on its own — budget never tightened).
## What's deferred
- Spec-decode integration of BudgetHook. Current implementation
trades spec-decode throughput for correctness on thinking turns.
Adding the hook to do_spec_decode requires careful sequencing
around the verify-and-accept loop (codex flagged this in plan
review) — separate PR.
- Soft-close (voluntary close when </think> is in top-K). Needs
top-K logits exposure from the verifier. Level 2.5.
- Laguna and gemma4 backends — only qwen35 wired. Per codex's "start
qwen35 only, benchmark, then port" guidance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dflash/scripts/server.py was a parallel FastAPI implementation that shadowed the C++ dflash_server. Keeping it invites bifurcation — any fix to one had to be mirrored to the other, in practice always behind. Drops: - dflash/scripts/server.py (3484 lines, FastAPI app) - dflash/scripts/test_server.py (imports 14+ symbols from server) - dflash/scripts/test_prefix_cache.py (imports build_app from server) Updates lucebox/lucebox/profile.py to remove test_server.py from the pytest invocation and registered script_paths. Out of scope (will break at runtime, fix in a follow-up): - dflash/scripts/test_multi_turn_prefix_cache.py - dflash/scripts/test_server_prefix_cache.py - dflash/scripts/test_full_compress_cache.py - dflash/scripts/bench_agent_loop.py - dflash/scripts/bench_daemon.py - dflash/scripts/bench_server.py - dflash/scripts/parity_laguna.py - dflash/scripts/quality_ab_simple.py - dflash/scripts/quality_humaneval_plus.py These reference server.py via subprocess.run / SERVER_SCRIPT path constants. They'll fail at exec time, not import time, so they don't block the deletion. Retargeting them to dflash_server is left as a separate task — the bifurcation pressure is removed. C++ source comments still reference server.py as historical context for parity (e.g. http_server.cpp:1376 "see server.py:2271-2281"). Those are notes, not load-bearing.
OPENAI_CHAT branch in http_server.cpp now emits the reasoning text under three keys (reasoning_content, reasoning, reasoning_details), plus surfaces the phase-1 token count under usage.completion_tokens_details.reasoning_tokens. Implements docs/specs/thinking-budget.md "Response shape — multi-dialect aliasing". Bench (bench_http_capability.py) reads reasoning via fallback chain: reasoning_content -> reasoning -> reasoning_details[].text. Makes cross-server runs (sindri/vidar/OpenRouter) directly comparable; OR's reasoning is no longer silently discarded. Also drops a stale 'see server.py:2271-2281' comment now that the Python server is gone.
Pulls in PR Luce-Org#260 (howard0su): fix(server): normalize Codex Responses tool-call follow-ups, which improves the sse_emitter THINK_OPEN/CLOSE extraction parser. Also picks up frequency/presence penalty sampling, test tightening, and gemma4 HIP runtime fixes. # Conflicts: # dflash/src/server/http_server.cpp
…ant>
Two related fixes that were keeping bragi's laguna unable to bench:
1. Chat template was DeepSeek-V3 tokens (<|begin▁of▁sentence|> /
<|User|> / <|Assistant|>) — those token strings don't exist in
laguna's vocab so the model saw replacement-character garbage in
its prompt and degenerated into echoing the user message with
<���Assistant���> artifacts. Real format (verified against
poolside/Laguna-XS.2/chat_template.jinja):
〈|EOS|〉<system>
{content}
</system>
<user>
{content}
</user>
<assistant>
<think> ← if enable_thinking
</think> ← if NOT enable_thinking (empty think block)
Default system message used when one isn't supplied, matching the
upstream template's "You are a helpful, conversationally-fluent
assistant made by Poolside..." default.
2. laguna_target_loader.cpp left eos_chat_id = -1 when the GGUF only
ships tokenizer.ggml.eos_token_id (id 2 = 〈|EOS|〉) without an
eot_token_id. With eos_chat_id=-1 the decoder check
`next == eos_chat_id` never matches and the model emits its turn
</assistant> mark, sees nothing stops it, and re-greets the user.
Default to 24 (= </assistant>, the chat-template EOT) when the
GGUF doesn't supply eot — matches the constant already in
laguna_internal.h.
Validated post-fix: 17×23 prompt returns "Answer: 391" cleanly
(was repeating answer endlessly with the prior bugs).
Sindri/integration: change is in the same chat_template.cpp file
041f491 refactored (thinking_preamble removed); LAGUNA case rewritten
in place. No conflict with 7786b35 (Phase-2 removal) — that touched
http_server only.
The locked file was stale relative to pyproject.toml — uv sync --frozen failed at docker build, and the fallback uv sync (re-resolve) hit nvidia-cudnn-cu12 / cu128 version conflicts. Local uv sync produced this updated lock; committing so docker builds succeed without needing --no-cache hacks.
… partial Three snapshots from the 2026-05-25 sweep matrix: - bragi gemma-4-26b nothink: 72/92 = 78.3%, 42 min wall, 99.6 tok/s agg (wall-based). Beats OR-hosted nothink (73.9%) on both quality and speed. Run at temp=1.0/top_p=0.95/top_k=64 via the new --sampling-from-card flag. - bragi gemma-4-26b think: 75/92 = 81.5%, 145 min wall, 101.4 tok/s agg. +3.2pp over local nothink, +10.5pp over OR-hosted think (71%). The new <channel|>\n\n transition cue + 4096 hard_limit_reply_budget let thinking-mode runs actually finalize (vs the pre-fix hits-length-and-degenerates behavior). - bragi gemma-4-31b nothink: partial — 8/8 PASS before container OOMed at case 9 (697-token prompt's prefill graph needed 123 MiB beyond the 23.4/24 GB ceiling). Server-log per-case rate ~19 tok/s. Full 31b sweep is impractical on bragi-class consumer 24GB hardware (math: 19GB model + 1.6GB draft + ~390 KB/token KV cache = ~5k context max). Per discussion: 31b deferred until larger card available.
OR provider routing for poolside/laguna-xs.2:free was ignoring our
existing `thinking:{type}` and `chat_template_kwargs.enable_thinking`
fields. Empirically validated only top-level `reasoning_effort`
(OpenAI-shape) propagates — `reasoning_effort: "none"` cleanly
disables reasoning_tokens emission. Other shapes
(`reasoning:{effort:"minimal"}`, `reasoning:{exclude:true}`,
`extra_body.chat_template_kwargs.enable_thinking:false`) all left
reasoning enabled. Added to run_case body so OR runs no longer
silently keep reasoning on — without this, the 2026-05-24 fill-matrix
laguna data showed identical 55/92 for "think" and "nothink" (both
were actually thinking).
Laguna sweep results, ds4-eval-92, temp=0.6/top_p=0.95/top_k=50,
parallel=1:
- bragi LOCAL Laguna-XS.2-Q4_K_M (with the chat-template fix
92f84cd + eos_chat_id=24 fallback + Poolside speculator):
42/92 = 45.7%, true nothink (thk=0 every row), ~118 tok/s decode.
- OR poolside/laguna-xs.2:free BF16 (with reasoning_effort=none):
47/92 = 51.1%, true nothink (reasoning_tokens=0).
The ~5pp gap is the Q4_K_M quantization vs OR's BF16, plus any
provider-side post-decode handling. Within sampling noise on 92
cases. Confirms the laguna stack on bragi is now correctly tuned.
Closes out the laguna comparison stack. Both backends, both modes,
sequential (parallel=1) for fair comparison:
nothink think
bragi LOCAL Q4 42/92 = 45.7% 48/92 = 52.2%
OR :free BF16 47/92 = 51.1% 53/92 = 57.6%
Δ (OR - bragi) +5.4pp +5.4pp
Δ (think - noth) bragi +6.5pp OR +6.5pp
The Q4→BF16 quantization gap is identical (5.4pp) in both modes,
confirming it's pure quant loss rather than a chat-template or
eos handling difference. The think→nothink lift is identical
(6.5pp) in both backends, confirming laguna's thinking mechanism
provides consistent uplift regardless of quantization.
Verifies the 2026-05-25 laguna fix stack works end-to-end:
- chat template (92f84cd) renders <system>/<user>/<assistant>/<think>
correctly (vs the prior DeepSeek-token garbage)
- eos_chat_id=24 fallback stops the model cleanly mid-stream
- reasoning_effort knob (417009d) lets OR truly disable thinking
- DFlash speculator load works via existing safetensors path
Single AIME case (aime2025-02, correct=588) at 6 budget points shows the diminishing-returns curve clearly: - B=512/1024: model can't reach answer before force-close - B=2048: sweet spot — model self-closes, gets 588 ✓ - B=4096..16384: identical correct answer at 2-3× wall time Reply phase length is constant ~4k tokens regardless of budget (hard_limit_reply_budget=4096 governs that). Decode rate constant ~101 tok/s — wall scales linearly with reasoning tokens emitted. Closes out the original /loop directive's "visibility into the optimal config for thinking budget" item. Full summary in _summary.md.
The bench was sending temperature/top_p explicitly with hardcoded
greedy values (temp=0, top_p=1.0) in every request, defeating the
server's card-fallback path at http_server.cpp:761-765:
req.sampler.temp = body.value("temperature",
sd.has_temperature ? sd.temperature : 0.0f);
When the bench includes the field, the body.value() never reads the
card default. Effect: gemma4 (card recommends temp=1.0/top_p=0.95/
top_k=64) was forced to greedy on every bench, triggering its known
`- - - -` degenerate-decode collapse (see docs/experiments/gemma4-26b-
thinking-control-2026-05-25.md). The just-completed sindri sweep
shows 3.3% pass on think mode and 47.8% on nothink vs bragi's
81.5% / 78.3% — entirely explained by this miss.
Inconsistent prior behaviour: top_k was already conditional (sent
only when > 0). temperature and top_p were not. This patch makes all
three consistent — sent only when explicitly set via flags. Default
behaviour now is "let the server pick", which means card defaults
for dflash, provider defaults for OpenRouter/Anthropic.
--sampling-from-card is kept as a deprecated no-op so existing
scripts don't break (logs a notice if passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The luce-bench package (https://github.com/easel/luce-bench) v0.2.3 now ships all of the HTTP-based capability benchmarks (ds4-eval, HumanEval, longctx, agent, forge) plus the thinking-control probe as a stdlib-only, uvx-installable framework with 53 tests. Folding luce-hub's duplicate copies in deletes ~20k lines of code that we were maintaining in two places. Deleted (Tier 1 — pure duplicates of luce-bench): bench_ds4_eval, bench_humaneval, bench_he_http, bench_longctx, bench_http_capability (1602 LOC), bench_http_frontiers, probe_thinking_control, bench_daemon, bench_server, lucebox_bench (987 LOC), test_lucebox_bench (622 LOC), bench_agentic_session, bench_agentic_tools Deleted (Tier 2 — vestigial native-binary dev-cycle benches; the test_dflash binary itself stays and the run.py / placement / quality / parity / examples scripts that drive it are unaffected): bench_he, bench_llm, bench_agent, bench_long_ctx, bench_agent_loop, bench_agent_cases Deleted (vendored fixtures now in luce-bench): fixtures/forge_eval/ (forge-guardrails 0.7.1 runtime + scenarios) fixtures/humaneval/cases.json fixtures/ds4_eval_cases.json Updated: pyproject.toml — add `luce-bench` dep pinned to git tag v0.2.3 (switch to PyPI version pin once trusted-publishing lands); drop deleted-file entries from ruff `include` list. dflash/pyproject.toml — drop the `eval` extra (anthropic SDK now lives behind luce-bench's `[forge]` extra). dflash/scripts/entrypoint.sh — `benchmark` subcommand now execs `python -m lucebench.cli` so `docker run image benchmark …` keeps working with the new framework. lucebox/lucebox/profile.py — drop 6 StepDefinitions that subprocessed deleted scripts (benchmark.http_frontiers, quality.capability_smoke, quality.ds4_eval, quality.capability_long, quality.agentic_tools, benchmark.agentic_session). Profile registry shrinks from 9 → 3 steps (health.props, benchmark.autotune_latest, test.python_unit pointing only at the lucebox/tests suite now). lucebox/tests/test_profile.py — point removed step IDs at remaining steps or delete the test (ds4_eval-specific argv test). .gitignore — exclude .claude/ (session worktrees + agent scratch). Validation: uv sync → luce-bench==0.2.3 from git uv run pytest lucebox/tests -q → 19/19 passed uv run ruff check → All checks passed uv run ruff format --check (touched) → clean Sweep continuity: the in-flight 26b --think sweep (luce-bench writing to baselines/bragi-rtx5090laptop-gemma4-26b-2026-05-26-sweep-think/) keeps running; no GPU disruption. Future cleanups (not in this commit): - PyPI publish luce-bench, switch `tool.uv.sources` to a plain version pin instead of git tag. - dflash/docs/{experiments,run-requests,RESULTS}.md still cite `bench_*.py` by name — those are historical refs that don't affect any runtime, can be swept on a docs-only pass.
The luce-bench-baselines repo (https://github.com/easel/luce-bench-baselines) is now the canonical home for all benchmark snapshot data. All 41 snapshot dirs (plus the handful of top-level SUMMARY/profile/log files) that lived under dflash/docs/tuning-snapshots/ have been mirrored over in luce-bench-baselines commit 067e7c9; this commit drops them from lucebox-hub so future clones don't carry 43MB of historical data the server build doesn't need. New sweeps already land in luce-bench-baselines directly via: uvx --from luce-bench luce-bench --sweep --name <host-model-date> \\ --out-dir /path/to/luce-bench-baselines \\ --base-url http://<server>:8080 .gitignore picks up the path so an accidental write here stays untracked. Mirror is verified: every dir/file removed here exists at the same name in the baselines repo.
The intent shipped in 925d41f's commit message but the file change got lost — the cat >> happened after the `git add` for that commit. Adding it now as a one-line follow-up so any future write to that path stays untracked (snapshots live in luce-bench-baselines).
Brings in 52 upstream commits since merge-base 8c23234 (2 days ago). The headline is PR Luce-Org#269 (`403e598 feat(cpp-server): thinking-budget v2 + multi-dialect reasoning + model-card sidecars`) — the squashed version of our own thinking-budget v2 work that we'd been carrying as 50+ small commits on this branch. Plus: - PR Luce-Org#262 howard0su/powerinfer: hybrid MoE spec-decode with DFlash draft, GPU/CPU FFN overlap, persistent pre-FFN graph for DeltaNet, 4-5x decode speedup, MoE perf telemetry - PR Luce-Org#263 weicj/feat-cpp-server-pflash-draft-placement: mixed-backend PFlash phase split - 648a6e2 perf: GPU-resident hybrid decode (eliminate PCIe round-trips) - d12ddde fix: dynamic placement uses --max-ctx instead of hardcoded 8192 Conflict resolution: where this branch has post-Luce-Org#269 refinements (gemma4 timings via 4e9abda, degenerate-decode watchdog via c2d725f / 8538ff9, laguna chat-template fix via 92f84cd, transition cue via 16bb31e, thinking-budget force-close fix via b86342d, etc.), keep ours. Where the only divergence is "this branch has 50 small commits that 403e598 squashes," accept the merged result. # Conflicts: # dflash/scripts/server.py # dflash/src/qwen35/qwen35_backend.cpp # dflash/src/qwen35/qwen35_backend.h # dflash/src/server/http_server.h # dflash/src/server/sse_emitter.cpp # dflash/test/test_server_unit.cpp # share/model_cards/laguna-xs.2.json
|
@easel great work! I would like to promote to main asap. Can you rebase and integrate with harness autorun scripts? |
…name) Brings in PR Luce-Org#281 (chore: rename dflash→server, pflash+megakernel → optimizations/) + small docs polish 080f89b. Our lucebox/ Python package (added by us in 2560086, never upstream) is untouched. Our docs additions under dflash/docs/* are migrated to server/docs/*. Our deletions of bench scripts confirmed against the new server/scripts/* paths. Workspace members in pyproject.toml: ["server", "lucebox", "optimizations/megakernel", "optimizations/pflash"] — preserving our lucebox member alongside upstream's renamed paths. # Conflicts: # README.md # pyproject.toml # server/docs/BENCHMARK_SNAPSHOT_SPEC.md # server/docs/experiments/cache-impact-2026-05-24.md # server/docs/experiments/gemma4-26b-thinking-control-2026-05-25.md # server/docs/experiments/kv-cache-q4-vs-tq3-2026-05-25.md # server/docs/experiments/thinking-control-protocol.md # server/docs/experiments/thinking-mechanism-explainer.md # server/docs/run-requests/area-swe-bench-integration.md # server/docs/run-requests/bragi-gemma4-laguna-config-issues.md # server/docs/run-requests/forge-vs-vidar-ds4f.md # server/docs/run-requests/luce-dflash-think-92.md # server/docs/run-requests/qwen36-budget-signaling-overhaul.md # server/docs/run-requests/qwen36-hard-limit-reply-budget-bump.md # server/docs/run-requests/sindri-rtx3090ti-qwen36-nothink-92.md # server/scripts/bench_agent.py # server/scripts/bench_agent_loop.py # server/scripts/bench_daemon.py # server/scripts/bench_he.py # server/scripts/bench_he_http.py # server/scripts/bench_llm.py # server/scripts/bench_server.py # server/scripts/entrypoint.sh # server/scripts/fixtures/agent_cases/cases.json # server/scripts/server.py # server/scripts/test_prefix_cache.py # server/scripts/test_server.py
v0.2.4 fixes the broken `--area forge` path that was raising `TypeError: EvalConfig.__init__() got an unexpected keyword argument 'client_factory'` against any HTTP backend. Upstream commit easel/luce-bench@59e01fc realigns areas/forge.py with the refactored EvalConfig dataclass + run_scenario(client, scenario, config) signature. No other behaviour changes in v0.2.4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The standalone github.com/easel/luce-bench repo was an awkward split
for what is really part of the same engineering surface as the server.
Bringing it in-tree so PRs, CI, and reviews live alongside the server
work that the benches exercise.
Layout:
- luce-bench/ is now a uv workspace member alongside server, lucebox,
optimizations/megakernel, optimizations/pflash.
- `[tool.uv.sources] luce-bench = { workspace = true }` replaces the
prior `git = "https://github.com/easel/luce-bench.git", tag = "v0.2.4"`
pin. Future bumps land here directly.
- luce-bench keeps its own pyproject.toml with `name = "luce-bench"`
and a `version` it manages independently of the monorepo's release
cadence — PyPI sees the same package name.
Release flow (.github/workflows/release-luce-bench.yml):
- Triggered on tag pushes matching `luce-bench-v*` (e.g. luce-bench-v0.2.5).
- Asserts the tag's version suffix matches `luce-bench/pyproject.toml`.
- Builds wheel + sdist from luce-bench/, publishes via PyPI trusted
publishing (OIDC) under the `pypi` environment. Set up the trusted
publisher in the PyPI project once: repo=easel/lucebox-hub,
workflow=release-luce-bench.yml, environment=pypi.
The standalone repo will be archived (read-only) — existing tag pins
keep resolving for anyone consuming v0.2.4 or older from there. Files
copied at v0.2.4 (commit easel/luce-bench@59e01fc): src/, tests/,
README.md, NOTICE, LICENSE, pyproject.toml.
53 lucebench tests pass against the in-tree workspace install.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bench-script cleanup (89b1dfe) left profile.py hollowed out — its 6 benchmark StepDefinitions were subprocessing scripts that no longer exist, leaving only 3 trivial probes (health.props, autotune-report read, pytest). The snapshot/hash/dedup machinery was sitting on a near-empty registry. The whole point of `lucebox profile` is "capture performance snapshots that feed autotune" — that's exactly what luce-bench now produces. Wire the framework to it instead of deleting it. Adds 4 luce-bench-driven StepDefinitions to the registry: - benchmark.code — `lucebench.cli --area code` (HumanEval, 10 cases) - benchmark.longctx — `lucebench.cli --area longctx` (6 cases) - benchmark.agent — `lucebench.cli --area agent` (4 cases, mt=4096) - quality.ds4_eval — `lucebench.cli --area ds4-eval --think --max-tokens 16000 --timeout 1800` (full 92-case suite, score-only) Each one builds its argv via the new `_luce_bench_area_argv()` factory. Output JSON lands in the framework-owned dest dir so the existing snapshot/hash/dedup pipeline ingests it without changes. The framework's content-addressed hash (by hardware + model + tunables) is unchanged — re-running profile with the same config short-circuits to the cached snapshot; changing any tunable forces a fresh capture. Test coverage: - test_luce_bench_argv_shape_for_each_area — verifies argv shape for all 4 luce-bench steps + the per-step knobs (think mode, max_tokens, timeout, model). Catches regressions in the argv builder without needing a live server. Plus a small Dockerfile fix: UV_PYTHON_INSTALL_DIR=/opt/uv/python so the venv's python interpreter is world-readable. Default location is `/root/.local/share/uv/python/` which non-root container UIDs cannot traverse — broke `lucebox.sh check` and every other host-wrapper subcommand. The container runs as the host UID for bind-mount sanity (config files in $HOME stay user-owned), so the interpreter has to live somewhere world-traversable. Lucebox tests: 20/20 pass. Ruff + format clean.
PR Luce-Org#281 moved dflash/ → server/. The pull_request `paths:` filter still targeted dflash/* — so PRs touching the C++ server code wouldn't trigger the Docker prebuild sanity check. Repoint to server/ so CI catches Dockerfile / source regressions before merge.
server_main.cpp had two identical 119-line blocks of the thinking-budget
v2 model-card resolution code (general.name / general.architecture
read, resolve_model_card(), ServerConfig application, tier clamping).
g++ errored out on redeclarations:
redeclaration of 'std::string general_name'
redeclaration of 'std::string general_arch'
redeclaration of 'dflash::common::ModelCard card'
redeclaration of 'const int tier_ceiling'
conflicting declaration 'auto clamp_tier'
The duplicate was a merge artifact from 1df9099 (luce-org/main into
integration/-clean post-rename). Upstream PR Luce-Org#269 squashed our pre-PR
work into 403e598 and our integration branch carried the same code in
unsquashed form; the 3-way merge kept both copies.
Also drop a redundant `#include "gguf.h"` (lines 23 and 25 of the same
file). Harmless thanks to include guards but ugly merge residue.
Build proceeds past cmake configure with the dedupe.
GitHub Actions only picks up workflows from the repo-root `.github/workflows/`; the nested `luce-bench/.github/workflows/ci.yml` was inherited from the standalone repo but never fires here. Its publish job is also superseded by `release-luce-bench.yml`. The nested `.gitignore` mostly duplicated root entries; moved its one unique pattern (`luce-bench/snapshots/` for --sweep output) into the root `.gitignore`. Also fixes a stale `-> "_RecordingAnthropicClient"` forward reference in areas/forge.py that the root ruff configuration flags (the class is in scope where the annotation is evaluated; the quotes are dead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 490ff95 absorbed luce-bench into the monorepo as a uv workspace member, but didn't update the Dockerfile. The runtime stage's `uv sync` would fail looking for /opt/lucebox-hub/luce-bench/pyproject.toml because: 1. The builder stage's COPY block (~line 115-120) listed lucebox, optimizations/pflash, optimizations/megakernel — but not luce-bench. uv sync needs every workspace member's pyproject.toml to be present in the build context. 2. The runtime stage's COPY --from=builder block (~line 168) only pulled lucebox, server, optimizations across — not luce-bench. Add the two COPYs (builder source + runtime install) so the workspace resolution path is complete. No changes to cmake stage; CUDA build cache should still hit.
Wraps luce-bench in the same start-server → run-client → save-logs →
stop-server pattern as the other harness/clients/run_*.sh launchers
(run_codex.sh, run_claude_code.sh, etc.). Since luce-bench is just an
HTTP client of /v1/chat/completions, it fits the existing client
abstraction natively.
Why: operators get a uniform way to invoke luce-bench ("did this
server change break it?") alongside real-client smoke tests. A
regression in luce-bench surfaces in the harness sweep matrix the
same way an OpenCode or Hermes regression does — same launcher
contract, same logs, same finish_report shape.
Defaults: --no-think (4x faster on gemma-4-26b per the 2026-05-26
think/nothink comparison), full sweep mode (--sweep), 300s per-case
timeout, single-thread.
Knobs (env):
LUCEBENCH_AREA single area override (else --sweep)
LUCEBENCH_THINK 1 → --think, 0 → --no-think (default 0)
LUCEBENCH_MAX_TOKENS per-request decode cap override
LUCEBENCH_TIMEOUT per-case wall timeout (default 300s)
LUCEBENCH_PARALLEL in-flight concurrency (default 1)
All harness/common.sh knobs apply (MODEL_SERVER, LUCEBOX_SERVER_BACKEND,
MAX_CTX, BUDGET, EXTRA_SERVER_ARGS, etc.).
Output: $LOG_DIR/lucebench-{area,sweep}.{json,md} + lucebench.out
(stdout/stderr) + server.log. Slots into the existing run-dir layout
under /workspace/lucebox-client-harness-runs/<stamp>/.
Docs entry added to harness/clients/README.md.
Promotes harness/ from "loose shell scripts + one stdlib Python file"
to a proper uv workspace member that owns the "run X against a Lucebox
server" abstraction. Both lucebox profile and harness/clients/*.sh
launchers now go through the same Python entry point.
What's new:
- harness/pyproject.toml — name = "harness", dependencies = ["luce-bench"].
Stdlib-only at runtime (luce-bench itself is stdlib-only; anthropic
only via [forge] extra). Fresh test boxes can install with zero
external wheel downloads.
- harness/harness/ package:
- bench.run_bench(base_url, area, ...) — Python function form of
harness/clients/run_lucebench.sh. Composes the lucebench.cli argv
internally. Single source of truth for "run a luce-bench area
against a server", returns the parsed JSON snapshot.
- clients/claude_code.launch(base_url, model, prompt, interactive)
+ claude_env() helper — the env-var contract that points Claude
Code at a Lucebox server (ANTHROPIC_BASE_URL, telemetry-off
knobs, NONSTREAMING_FALLBACK kill). Used by both the new
`lucebox claude` subcommand and the existing
harness/clients/run_claude_code.sh wrapper.
- lucebox/lucebox/cli.py — new `claude` subcommand. Probes for a live
/health, looks up base URL, exec's claude on the host with the
right env via harness.clients.claude_code.launch. Interactive by
default (full TUI); `--prompt` makes it a one-shot run.
- lucebox/lucebox/profile.py — _luce_bench_area_argv now delegates
to `python -m harness.bench` instead of `python -m lucebench.cli`.
All four bench StepDefinitions (code, longctx, agent, ds4-eval)
ride that path. Framework still owns the JSON snapshot path via
--json-out (new arg added to harness.bench too).
- pyproject.toml — adds harness to workspace members + sources,
declares harness as a root dep so uv sync installs it.
Tests:
- lucebox/tests/test_profile.py — updated argv-shape assertion for
every bench step: expects `harness.bench` instead of `lucebench.cli`.
- All 20 lucebox tests pass. luce-bench's 53 tests still pass.
- Ruff + format clean across lucebox/ + harness/.
Validation (live, against newly-rebuilt gemma-4-26b server):
- lucebox-hub:cuda12 image rebuilt with permission fix +
workspace luce-bench + harness COPY (separate Dockerfile commit).
- `docker run lucebox-hub:cuda12 serve` brought gemma up in ~15s.
- `python -m harness.bench --area forge --base-url http://localhost:8080`
ran 30 cases against gemma in both think + nothink modes
(0/30 both, ValidationError — separate luce-bench[forge]
adapter bug, not a regression of this PR).
This is the "interface luce-bench through the harness" shape the
README hinted at — one Python module, one CLI, one shell wrapper,
all converging on the same env-config + argv-building logic.
+ Dockerfile COPY harness + run_lucebench.sh via harness.bench Builds on 7bbf9af (harness as workspace member). All six harness/clients/ launchers now exist as both shell wrappers AND Python modules with the same launch() contract. Five new `lucebox <client>` Typer subcommands. New launchers in `harness.harness.clients`: - codex (writes config.toml; Responses API wire format) - opencode (writes opencode.json; AI-SDK OpenAI-compatible provider) - hermes (writes config.yaml + .env; chat_completions wire format) - pi (writes agent/{settings,models}.json; openai-responses) - openclaw (writes JSON config patch merged at startup) Each module: - Resolves binary via $<X>_BIN env, $PATH, or test-box convention ($CLIENT_WORK_DIR/clients/<x>/...) — shared `_common.find_bin()` - Writes per-run config into a tempdir so the user's real client state is untouched - exec()s with stdio inheritance (interactive TUI) or stdin from /dev/null + optional timeout (non-interactive `--prompt`) - Provides a `main()` for ad-hoc CLI use (harness-codex, etc.) - Stdlib-only at runtime Five new `lucebox <client>` Typer subcommands in lucebox/lucebox/cli.py: - `lucebox codex|opencode|hermes|pi|openclaw [--prompt P] [--url U] [--model M]` - Shared `_detect_server_url()` probes the standard localhost/docker bases for /health, picks the first responder - Shared `_exec_client()` does the launcher dispatch + typer.Exit translation - Each subcommand is ~10 lines: import the launcher, call the helper Dockerfile: - COPY harness /src/harness in the builder stage (alongside lucebox, luce-bench, optimizations/{pflash,megakernel}) so uv sync resolves the workspace member - COPY --from=builder /src/harness /opt/lucebox-hub/harness in the runtime stage so profile.py inside the container can `python -m harness.bench` (the path it uses since 7bbf9af) harness/clients/run_lucebench.sh: - Switch the underlying call from `python -m lucebench.cli` to `python -m harness.bench` for consistency with profile.py's delegation path. Both go through harness.bench now. Tests: 20/20 lucebox tests still pass. Ruff + format clean. Out of scope (deferred): - openwebui + openwebui-tools — separate web service lifecycle (start-server-in-background, poll for ready, etc.) — port as a follow-up if needed - lucebox.sh host-side dispatch — currently `lucebox.sh <client>` routes into the container (where the client binary isn't installed); need a host-side `cmd_client` that runs the client binary on the host. Works today via `uv run python -m lucebox <client>` directly.
…/LICENSE)
uv builds harness inside an isolated sandbox at /src/harness/, where the
parent ../LICENSE file is not visible. hatchling errored out:
OSError: License file does not exist: ../LICENSE
hint: `harness` was included because `lucebox-hub` (v0.0.0) depends on `harness`
Switch to inline `license = { text = "Apache-2.0" }` (same pattern lucebox/
and server/ use). Matches PyPA 2025 guidance and avoids the sandbox-path
trap. The text of LICENSE itself stays at repo root.
Old auto-detect had a hardcoded Qwen3.6 preference: when both
Qwen3.6-27B-Q4_K_M.gguf and gemma-4-26b-a4b-it-Q4_K_M.gguf were
present in models/, Qwen always won, silently. This hid a real bug
during the 2026-05-27 matrix run — the container was supposed to
serve gemma4 but ran qwen for 10 hours of sweep-think before the
operator noticed bench numbers were wrong (and the wrong-target draft
slowed decode 4× because the qwen draft GGUF was q4_k_m vs gemma's
q8_0, causing every think-mode case to hit the 300 s bench timeout).
New behavior:
- Find ALL .gguf candidates ≥5 GB outside models/draft/ (the size
threshold cleanly excludes draft GGUFs without parsing GGUF
arch metadata).
- 0 candidates → die with clear message + DFLASH_TARGET hint
- 1 candidate → use it, log "Auto-detected target: <name>"
- 2+ candidates → WARN with the full list (marking the choice with
*), tell the operator to pin DFLASH_TARGET=<path>, then pick the
first alphabetically (deterministic across runs).
Trade-off: gemma4 wins over qwen3.6 alphabetically. That's not a value
judgment — it's just deterministic. The point is the warn-loudly path,
not the choice of which model wins by default. Operators with both
models present MUST set DFLASH_TARGET to skip the warning.
The hardcoded Qwen3.6 family preference path is gone — fundamentally
the wrong shape (silently picking based on a name pattern). If we
want a "preferred model" knob later it should be DFLASH_PREFERRED_TARGET
or similar, with the same warn-when-multiple-candidates rule.
|
Superseded by a flattened single-commit branch New PR forthcoming. |
No description provided.