docker, cli, smoke, bench and autotune first-run dx by easel · Pull Request #226 · Luce-Org/lucebox-hub

easel · 2026-05-19T13:13:35Z

No description provided.

cubic-dev-ai

14 issues found across 37 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

easel · 2026-05-20T18:26:58Z

Addressed the review issues identified by cubic in commit 067f4ac.\n\nLocal verification:\n- uv run --frozen --extra dev ruff check .\n- uv run --frozen --extra dev python -m mypy --package lucebox\n- uv run --frozen --with pytest pytest dflash/scripts/test_lucebox_bench.py dflash/scripts/test_server.py lucebox/tests -q\n- bash -n lucebox.sh && bash -n dflash/scripts/entrypoint.sh\n- docker buildx bake --print cuda12\n- docker buildx bake --print cuda12-local

Closes two of the three feature gaps between dflash/scripts/server.py (Python, reference impl) and dflash/src/server/dflash_server (C++, production runtime), as outlined in the migration plan. ## /props introspection Wire /props in http_server.cpp::handle_client. JSON shape matches server.py:1221-1312 key-for-key so cross-server consumers (autotune sweeps, dashboards, lucebox profile/snapshot) see a stable contract. ServerConfig grows the introspection inputs (arch, model_path, draft_path, kv_cache_k/v, runtime_backend, fa_window, ddtree_budget, speculative_enabled, target_sharding, tokenizer_id) populated by server_main before HttpServer construction. PrefixCache and ToolMemory gain stats() / full_stats() accessors with the same lockless-snapshot semantics the Python impl documents — a mutation under daemon_lock can tear in_use vs lifetime_hits across the read pair; acceptable for /props. PROPS_SCHEMA = 1 (matches Python's current schema). Bump only on breaking changes: field renamed, removed, or semantics-changed. ## /v1/messages/count_tokens Reuses the Anthropic message-parsing path from /v1/messages, short- circuits after tokenization with {"input_tokens": N}. <1s on a hot server (no generation). ## Tests Integration coverage in dflash/scripts/test_server_integration.py: - TestProps: top-level keys, server block shape, speculative_mode consistency, runtime backend, arch-gated capabilities, API endpoint registry, prefix_cache/tool_replay shapes. - TestCountTokens: simple count, scaling with message length, system block handling, <1s budget assertion. Thinking-budget surface (--think-max-tokens flag, finish_details, thinking_opt_in tracking) is the second PR in this migration — intentionally separated so the algorithm-design review (Level 1/2/3 fidelity question) doesn't block /props from landing. Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean with CUDA layers cached. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_details) Closes the third feature gap in the Python→C++ server migration: the thinking-budget wire surface so consumers (lucebox bench, dashboards) can opt in to the envelope and see the close-info block on responses. ## What ships - `--think-max-tokens` CLI flag (default 10000): cap on the phase-1 reasoning generation when a request opts in. - `--default-max-tokens` CLI flag (default 16000, matches antirez/ds4 ds4_eval.c's `.max_tokens`): combined cap used when a request omits max_tokens. - `thinking: {type: "enabled"}` is now tracked as a presence-opt-in (`ParsedRequest.thinking_opt_in`) so the server can condition response shape on it. - `finish_details` block on /v1/chat/completions responses when thinking is opted in. Fields match docs/specs/thinking-budget.md:43-58 and server.py:2272 (close_kind, thinking_tokens, content_tokens, total_tokens). ## What's deferred (intentional) The phase-1/phase-2 reprompt MECHANISM (port of server.py:2141-2196) is not in this PR. close_kind always reports "natural" for now — the C++ server doesn't yet force-close on hard_limit. The Python server's existing behavior continues to be the reference impl. Why ship the surface first: - Unblocks consumers (the lucebox bench can stop sending custom envelope fields and just send standard OpenAI + this opt-in). - Lets us land /props first without algorithm-design review on this PR. - The Level 1 vs Level 2 vs Level 3 fidelity question (phase-1/phase-2 reprompt vs true mid-stream force-close vs full eval_think_close_info reporting) is a separate design conversation — the wire shape stays the same regardless of which Level lands. ## Tests Integration coverage in test_server_integration.py: - TestThinkingBudget.test_finish_details_present_when_thinking_opted_in: asserts the block is emitted with valid types when `thinking:{type:enabled}` is sent; invariant `thinking_tokens + content_tokens == total_tokens`. - TestThinkingBudget.test_finish_details_absent_when_thinking_not_opted_in: asserts the block is NOT emitted on a plain chat request, matching Python's server.py:2271 conditional. ## Follow-ups (separate PRs) - Level 1 phase-1/phase-2 reprompt: port server.py:2141-2196 into worker_loop. Sets close_kind="hard" when phase-1 didn't emit </think> within think_max_tokens. ~1 day. - Level 2 true in-process force-close: backend-level sampler hook, matches ds4_eval.c:3027-3056 hard_limit_reply_budget semantics. Adds --soft-limit-reply-budget / --hard-limit-reply-budget flags. Closes the OpenRouter→native pass-rate gap (~20 pts). ~2-3 days. Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bundles three pieces: ## Level 1 thinking-budget mechanism (port of server.py:2141-2196) When a request opts in via `thinking: {type: "enabled"}` and the model fails to emit `</think>` within `--think-max-tokens`, the server now caps phase-1, decodes the reasoning, appends `</think>\n\nFinal answer: `, re-prefills, and runs phase-2 for `max_tokens - phase1_emitted` tokens. Non-streaming only. Streaming phase-2 is a follow-up (needs SSE flush + re-open). finish_details.close_kind now flips "natural" → "hard" when phase-2 fires; thinking_tokens / content_tokens report the real split. This brings C++ to Python parity for thinking-budget enforcement and should close most of the 26-percentage-point gap to the Python server's ds4-eval pass rate that the bench has been measuring. ## Codex review fixes (against 3f600f9 + 8d6ff04) 1. /props `speculative_mode` was reporting "dflash" based on arch capability instead of `--ddtree`-active state, contradicting the `speculative.enabled` block in the same response. Now keyed on `config.speculative_enabled`. 2. `--default-max-tokens` was a dead flag: the request parser fell back to `config_.max_tokens` (legacy 4096) when clients omitted max_tokens, so the new 16000 default was never applied. Parser now reads `default_max_tokens` directly. Side effect: entrypoint.sh doesn't need to explicitly pass `--default-max-tokens` either; the 16000 default is now what the server uses out of the box, restoring parity with the Python server's documented default. 3. Phase-2 reprompt could exceed `max_ctx` when the prompt already sits near the boundary, because ph2_prompt grew by phase1_tokens + closing_ids but ph2_gen_len was only clamped by remaining budget. Added a second clamp against `max_ctx - ph2_prompt.size() - 20`. ## Tests `TestThinkingBudget.test_close_kind_natural_when_model_self_closes` (easy prompt, model self-closes well under cap; asserts close_kind=="natural"). Hard-close test deferred — requires a server launched with very low `--think-max-tokens`, which the integration suite doesn't currently parametrize. ## What's not in this commit - `finish_details` block stays as-is. Codex's review didn't flag it as a problematic protocol invention despite a targeted prompt; reconsidering it is a separate cleanup if needed. - Streaming phase-2 — Level 1.5 follow-up. - Level 2 (in-process force-close via backend sampler hook) — next PR. - Test coverage gaps around phase-2 hard-close path, multi-arch thinking-tag mapping (Gemma4 `<|channel>`), stop-sequence behavior across phases. Tracked for the follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… limit `semantic_hint_present` in bench_http_capability.py parsed all `-?\d+` matches in the model output via `int(match.group(0))`. A degenerate model emission of a many-thousand-digit number trips Python 3.11+'s default 4300-digit int() limit and crashes the bench mid-run. Cap match length at 20 digits before parsing — real answers never exceed that, and longer runs can't ever equal an expected answer anyway. Surfaced as a hard crash on row 8 of the local lucebox v2 ds4-eval run yesterday. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The C++ chat template was matching Qwen3.5's behavior (let the model decide whether to emit <think>) when enable_thinking=true. Qwen3.6's official chat_template.jinja actually pre-opens the thinking block: enable_thinking=true → suffix `assistant\n<think>\n` enable_thinking=false → suffix `assistant\n<think>\n\n</think>\n\n` Without this prefix, requests that opted into thinking via `thinking:{type:enabled}` or `chat_template_kwargs.enable_thinking=true` silently stayed in non-thinking mode on Qwen3.6 — the model would answer directly without reasoning, no </think> tag ever appeared, and the Level 1 phase-2 reprompt mechanism never fired because the started_in_thinking flag never flipped true. Verified against transformers.AutoTokenizer for Qwen/Qwen3.6-27B: >>> tok.apply_chat_template(msgs, enable_thinking=True, ... add_generation_prompt=True, tokenize=False)[-20:] '<|im_start|>assistant\\n<think>\\n' Hot fix — caught while smoke-probing the freshly-rebaked C++ image before running the full ds4-eval bench. Without this, the bench would have produced numbers no better than non-thinking mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a standalone benchmark script that compares Q4_0 vs TQ3_0 KV cache at 32K / 64K / 128K context lengths, measuring both prefill time and DFlash decode tok/s. The point isn't throughput (TQ3 is slightly slower than Q4_0 at short contexts) — it's the memory saving: TQ3_0 (3.5 bpv) uses 22% less KV than Q4_0 (4.5 bpv), enabling longer contexts on the same VRAM budget. Automatically uses layer-segmented prefill (DFLASH27B_LAYER_PREFILL=1) for prompts over 8K tokens to reduce peak activation memory. Ported from feat/setup-results-uv@c725758; the README/RESULTS.md changes from that commit were dropped (clean has richer/newer versions). Only the new bench script is retained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ax_tokens banner Three fixes prepping for the phase-2 trigger reproducibility matrix (codex review plan step 1-3): ## http_server.cpp — phase-2 gate-input logging Per-request stderr line printing all 8 inputs to the phase-2 trigger condition: thinking_opt_in, started_in_thinking, stream, client_disconnected, phase1_tokens.size(), result.ok, req.max_output, phase1_cap, ph2_gen_len_est, close_in_phase1 (decode), and the last ~10 tokens of effective_prompt decoded. Lets us correlate probe-vs-bench, cache-on-vs-off, pflash-on-vs-off behavior when phase-2 fires inconsistently. Strip after Level 1 is stable. ## entrypoint.sh — draft path resolution (F1) The native dflash_server expects --draft to be a FILE path. The Python server's resolve_draft() walked the dir to find a GGUF; the C++ entrypoint was passing the directory directly, producing `draft load: mmap: No such device` on container startup. Resolve DFLASH_DRAFT to the largest dflash-draft-*.gguf inside the dir before passing to dflash_server. Removes the need for users to set DFLASH_DRAFT to a specific file path explicitly. Also includes the in-flight migration plumbing that was sitting in WT: - entrypoint header comment updated to "execs the native dflash_server" - `DFLASH_SERVER_BIN` env default - existence check now targets `$DFLASH_SERVER_BIN` instead of `$DFLASH_BIN` ## server_main.cpp — max_tokens banner truth + --prefix-cache-slots The startup banner printed `max_tokens = 4096` (legacy sconfig.max_tokens default) even though the request parser actually defaults to default_max_tokens=16000. Now prints default_max_tokens with a label explaining it's the request-omit default, plus a separate line for think_max_tokens (the phase-1 cap when opted in). Codex review #2. Also includes the in-flight --prefix-cache-slots CLI flag + parser branch + startup-log line that was sitting in WT (used by entrypoint.sh to forward $DFLASH_PREFIX_CACHE_SLOTS to the C++ server). ## Note on excluded WT Dockerfile and lucebox/lucebox/docker_run.py also have in-flight changes from this WT but they're orthogonal to the phase-2 diagnostic work — leaving them unstaged for whoever owns them to commit separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches dflash/scripts/bench_ds4_eval.py's DS4_EVAL_MAX_TOKENS = 16000, which mirrors antirez/ds4 ds4_eval.c's .max_tokens default. At 4096 the combined reasoning+reply budget truncated mid-CoT on harder cases — AIME2025 wall times of 60-400s with garbage numeric answers in the May-21 interrupted trace are consistent with reasoning running into the cap.

Local dev builds typically only need the host's compute capability, not the full 6-arch fat-binary the release image carries. The Dockerfile already accepted DFLASH_CUDA_ARCHES as a build arg; this just promotes it to a bake variable so a single env-var override skips the 5-6× CUDA template recompile. Example (RTX 5090 Laptop = sm_120, ~3min instead of ~20): DFLASH_CUDA_ARCHES=120 docker buildx bake cuda12-local --load Default unchanged (75;80;86;89;90;120) so CI + release images get full coverage without setting anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t 15488 Plumbs a new env var through the launch chain so the server's thinking-phase cap matches upstream antirez/ds4 ds4_eval.c by default. The dflash_server binary defaults --think-max-tokens to 10000 internally; nothing in lucebox/docker_run.py + entrypoint.sh was setting it, leaving a 35% gap vs upstream (15488 = 16000 max_tokens - 512 hard_limit_reply_budget). For ds4-eval this manifests as truncated reasoning on AIME / hard GPQA cases — the May-21 partial trace's nonsense AIME answers (case 51's 1e80-shaped output) were consistent with hitting the 10000 cap mid-CoT. Chain: - types.py: add think_max field to DflashRuntime (default 15488) - config.py: parse think_max from [dflash] section - docker_run.py: emit DFLASH_THINK_MAX in server_run_spec + benchmark_run_spec - profile.py: track think_max in runtime_tunables + live_tunables - entrypoint.sh: default DFLASH_THINK_MAX=15488, pass --think-max-tokens N Tests updated for the prior max_tokens 4096 -> 16000 fix that also moved.

The size-sorted resolver in 3e8323d picked model.safetensors (3.4GB HF raw weights) over dflash-draft-3.6-q4_k_m.gguf (1GB DFlash draft) because the safetensors file is bigger. dflash_server's spec-decode verify then crashes with `verify_batch: embed failed (n=16)` because the safetensors isn't in the format the DFlash draft path expects. Replace size-sort with priority-ordered pattern match: 1. dflash-draft-*.gguf (the canonical DFlash quantized draft) 2. *.gguf (any other GGUF — last-resort) 3. model.safetensors (HF raw — only if no GGUF at all) 4. *.safetensors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3.6's <think> (id 248068) and </think> (id 248069) are single special tokens in the added_tokens vocab. Both http_server.cpp paths (streaming on_token + non-streaming feed_tokens) had explicit handling for Gemma4's <|channel>/<channel|> but NOT for Qwen's variants. The generic "skip <...>" filter silently dropped the Qwen tokens, so the emitter never saw the reasoning→content transition. Symptom: ds4-eval scored 7/92 against the C++ server. 29 cases had close_kind="natural" + content_tokens=0 — model emitted </think> token 248069 but emitter dropped it, all 15488 tokens ended up in reasoning_content with empty visible content. Bench's answer extractor saw `content=""` and reported `given=? format=False`. After fix: when model emits token 248068/248069, forward the text form ("<think>" / "</think>\n") into the emitter so parse_reasoning splits the response correctly. The phase-2 trigger's close-detection (which already uses decode() and sees </think> from special-token decoding) keeps working identically — only the response builder changes. Also explains why close_kind="hard" cases (phase-2 fired) DID pass: the phase-2 emit_token("</think>\n\nFinal answer: ") manually injects the close text, bypassing the missing-mapping bug. Validated against the failing case's logs: model emits 248069, decoder returns "</think>", emitter now transitions and content gets populated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a "Response shape — multi-dialect aliasing" section to v2 that formalizes what dflash should emit for reasoning across DeepSeek / OpenRouter / Anthropic / OpenAI-shaped clients. Motivation: 2026-05-23 ds4-eval cross-server comparison against OpenRouter qwen/qwen3.6-27b showed the bench discarding ~4000 reasoning tokens because OR emits them under message.reasoning while dflash and the bench use message.reasoning_content. Same data, different field name — current dflash impl is correct but parochial. Plan: - Keep reasoning_content as primary (no break). - Add reasoning as a flat alias. - Add reasoning_details as a typed-block list — single-block today, room for phase-1/phase-2 splits later. - Surface phase1_tokens / finish_details.thinking_tokens into usage.completion_tokens_details.reasoning_tokens to match OpenAI o1/o3 shape and OR's normalized location. - Patch bench_http_capability.py fallback chain to read all three. Comparison table, full example response, per-field notes, motivation, and implementation status (all four pieces "planned, not yet shipped") included. No code change in this commit.

Implements antirez/ds4 ds4_eval.c's hard_limit_reply_budget at the backend sampling layer. When a thinking-enabled request approaches the budget boundary, the next sampled token is overridden with the `</think>` close-tag token, giving the model the remaining budget to write a visible answer with KV state continuous. Beats Level 1 phase-2 reprompt because the model never sees a fresh prefill — its reasoning context stays in KV cache and the answer flows naturally after the injected close. ## Changes - `common/model_backend.h`: new BudgetHook struct (close_token_id + hard_limit_remaining), opt-in field on GenerateRequest. Other backends ignore it (default-constructed = disabled). - `qwen35_backend.{h,cpp}`: do_ar_decode honors budget_hook. Override fires once per generation (budget_close_injected flag prevents double-injection on subsequent loop iterations). Logs to stderr with [budget-hook] tag when it fires. - `qwen35_backend::generate`: when budget_hook is set, routes through AR instead of spec-decode. Spec-decode integration is a follow-up (the perf hit is acceptable since this only affects thinking turns). Non-thinking turns still get full spec-decode throughput. - `server_main.cpp`: new `--hard-limit-reply-budget N` flag (default 512, matches ds4_eval.c). At startup, tokenizes "</think>" and caches the token ID in ServerConfig.think_close_token_id (only populated if it's a single token — Qwen3.6 = 248069). Multi-token close tags disable Level 2 with a warning, falling back to Level 1. - `http_server.{h,cpp}`: ServerConfig grows hard_limit_reply_budget + think_close_token_id. worker_loop wires BudgetHook into GenerateRequest when the request opts into thinking AND the server has both knobs set. ## What it gives us The model continues generating after the synthetic </think>, so its visible content includes the actual answer in the budget remainder. Mirrors ds4_eval.c's mid-stream injection exactly. Level 1 phase-2 reprompt remains as a fallback when force-close didn't fire (e.g. model closed </think> early on its own — budget never tightened). ## What's deferred - Spec-decode integration of BudgetHook. Current implementation trades spec-decode throughput for correctness on thinking turns. Adding the hook to do_spec_decode requires careful sequencing around the verify-and-accept loop (codex flagged this in plan review) — separate PR. - Soft-close (voluntary close when </think> is in top-K). Needs top-K logits exposure from the verifier. Level 2.5. - Laguna and gemma4 backends — only qwen35 wired. Per codex's "start qwen35 only, benchmark, then port" guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dflash/scripts/server.py was a parallel FastAPI implementation that shadowed the C++ dflash_server. Keeping it invites bifurcation — any fix to one had to be mirrored to the other, in practice always behind. Drops: - dflash/scripts/server.py (3484 lines, FastAPI app) - dflash/scripts/test_server.py (imports 14+ symbols from server) - dflash/scripts/test_prefix_cache.py (imports build_app from server) Updates lucebox/lucebox/profile.py to remove test_server.py from the pytest invocation and registered script_paths. Out of scope (will break at runtime, fix in a follow-up): - dflash/scripts/test_multi_turn_prefix_cache.py - dflash/scripts/test_server_prefix_cache.py - dflash/scripts/test_full_compress_cache.py - dflash/scripts/bench_agent_loop.py - dflash/scripts/bench_daemon.py - dflash/scripts/bench_server.py - dflash/scripts/parity_laguna.py - dflash/scripts/quality_ab_simple.py - dflash/scripts/quality_humaneval_plus.py These reference server.py via subprocess.run / SERVER_SCRIPT path constants. They'll fail at exec time, not import time, so they don't block the deletion. Retargeting them to dflash_server is left as a separate task — the bifurcation pressure is removed. C++ source comments still reference server.py as historical context for parity (e.g. http_server.cpp:1376 "see server.py:2271-2281"). Those are notes, not load-bearing.

OPENAI_CHAT branch in http_server.cpp now emits the reasoning text under three keys (reasoning_content, reasoning, reasoning_details), plus surfaces the phase-1 token count under usage.completion_tokens_details.reasoning_tokens. Implements docs/specs/thinking-budget.md "Response shape — multi-dialect aliasing". Bench (bench_http_capability.py) reads reasoning via fallback chain: reasoning_content -> reasoning -> reasoning_details[].text. Makes cross-server runs (sindri/vidar/OpenRouter) directly comparable; OR's reasoning is no longer silently discarded. Also drops a stale 'see server.py:2271-2281' comment now that the Python server is gone.

Pulls in PR Luce-Org#260 (howard0su): fix(server): normalize Codex Responses tool-call follow-ups, which improves the sse_emitter THINK_OPEN/CLOSE extraction parser. Also picks up frequency/presence penalty sampling, test tightening, and gemma4 HIP runtime fixes. # Conflicts: # dflash/src/server/http_server.cpp

…ant> Two related fixes that were keeping bragi's laguna unable to bench: 1. Chat template was DeepSeek-V3 tokens (<｜begin▁of▁sentence｜> / <｜User｜> / <｜Assistant｜>) — those token strings don't exist in laguna's vocab so the model saw replacement-character garbage in its prompt and degenerated into echoing the user message with <��Assistant��> artifacts. Real format (verified against poolside/Laguna-XS.2/chat_template.jinja): 〈|EOS|〉<system> {content} </system> <user> {content} </user> <assistant> <think> ← if enable_thinking </think> ← if NOT enable_thinking (empty think block) Default system message used when one isn't supplied, matching the upstream template's "You are a helpful, conversationally-fluent assistant made by Poolside..." default. 2. laguna_target_loader.cpp left eos_chat_id = -1 when the GGUF only ships tokenizer.ggml.eos_token_id (id 2 = 〈|EOS|〉) without an eot_token_id. With eos_chat_id=-1 the decoder check `next == eos_chat_id` never matches and the model emits its turn </assistant> mark, sees nothing stops it, and re-greets the user. Default to 24 (= </assistant>, the chat-template EOT) when the GGUF doesn't supply eot — matches the constant already in laguna_internal.h. Validated post-fix: 17×23 prompt returns "Answer: 391" cleanly (was repeating answer endlessly with the prior bugs). Sindri/integration: change is in the same chat_template.cpp file 041f491 refactored (thinking_preamble removed); LAGUNA case rewritten in place. No conflict with 7786b35 (Phase-2 removal) — that touched http_server only.

The locked file was stale relative to pyproject.toml — uv sync --frozen failed at docker build, and the fallback uv sync (re-resolve) hit nvidia-cudnn-cu12 / cu128 version conflicts. Local uv sync produced this updated lock; committing so docker builds succeed without needing --no-cache hacks.

… partial Three snapshots from the 2026-05-25 sweep matrix: - bragi gemma-4-26b nothink: 72/92 = 78.3%, 42 min wall, 99.6 tok/s agg (wall-based). Beats OR-hosted nothink (73.9%) on both quality and speed. Run at temp=1.0/top_p=0.95/top_k=64 via the new --sampling-from-card flag. - bragi gemma-4-26b think: 75/92 = 81.5%, 145 min wall, 101.4 tok/s agg. +3.2pp over local nothink, +10.5pp over OR-hosted think (71%). The new <channel|>\n\n transition cue + 4096 hard_limit_reply_budget let thinking-mode runs actually finalize (vs the pre-fix hits-length-and-degenerates behavior). - bragi gemma-4-31b nothink: partial — 8/8 PASS before container OOMed at case 9 (697-token prompt's prefill graph needed 123 MiB beyond the 23.4/24 GB ceiling). Server-log per-case rate ~19 tok/s. Full 31b sweep is impractical on bragi-class consumer 24GB hardware (math: 19GB model + 1.6GB draft + ~390 KB/token KV cache = ~5k context max). Per discussion: 31b deferred until larger card available.

OR provider routing for poolside/laguna-xs.2:free was ignoring our existing `thinking:{type}` and `chat_template_kwargs.enable_thinking` fields. Empirically validated only top-level `reasoning_effort` (OpenAI-shape) propagates — `reasoning_effort: "none"` cleanly disables reasoning_tokens emission. Other shapes (`reasoning:{effort:"minimal"}`, `reasoning:{exclude:true}`, `extra_body.chat_template_kwargs.enable_thinking:false`) all left reasoning enabled. Added to run_case body so OR runs no longer silently keep reasoning on — without this, the 2026-05-24 fill-matrix laguna data showed identical 55/92 for "think" and "nothink" (both were actually thinking). Laguna sweep results, ds4-eval-92, temp=0.6/top_p=0.95/top_k=50, parallel=1: - bragi LOCAL Laguna-XS.2-Q4_K_M (with the chat-template fix 92f84cd + eos_chat_id=24 fallback + Poolside speculator): 42/92 = 45.7%, true nothink (thk=0 every row), ~118 tok/s decode. - OR poolside/laguna-xs.2:free BF16 (with reasoning_effort=none): 47/92 = 51.1%, true nothink (reasoning_tokens=0). The ~5pp gap is the Q4_K_M quantization vs OR's BF16, plus any provider-side post-decode handling. Within sampling noise on 92 cases. Confirms the laguna stack on bragi is now correctly tuned.

Closes out the laguna comparison stack. Both backends, both modes, sequential (parallel=1) for fair comparison: nothink think bragi LOCAL Q4 42/92 = 45.7% 48/92 = 52.2% OR :free BF16 47/92 = 51.1% 53/92 = 57.6% Δ (OR - bragi) +5.4pp +5.4pp Δ (think - noth) bragi +6.5pp OR +6.5pp The Q4→BF16 quantization gap is identical (5.4pp) in both modes, confirming it's pure quant loss rather than a chat-template or eos handling difference. The think→nothink lift is identical (6.5pp) in both backends, confirming laguna's thinking mechanism provides consistent uplift regardless of quantization. Verifies the 2026-05-25 laguna fix stack works end-to-end: - chat template (92f84cd) renders <system>/<user>/<assistant>/<think> correctly (vs the prior DeepSeek-token garbage) - eos_chat_id=24 fallback stops the model cleanly mid-stream - reasoning_effort knob (417009d) lets OR truly disable thinking - DFlash speculator load works via existing safetensors path

Single AIME case (aime2025-02, correct=588) at 6 budget points shows the diminishing-returns curve clearly: - B=512/1024: model can't reach answer before force-close - B=2048: sweet spot — model self-closes, gets 588 ✓ - B=4096..16384: identical correct answer at 2-3× wall time Reply phase length is constant ~4k tokens regardless of budget (hard_limit_reply_budget=4096 governs that). Decode rate constant ~101 tok/s — wall scales linearly with reasoning tokens emitted. Closes out the original /loop directive's "visibility into the optimal config for thinking budget" item. Full summary in _summary.md.

The bench was sending temperature/top_p explicitly with hardcoded greedy values (temp=0, top_p=1.0) in every request, defeating the server's card-fallback path at http_server.cpp:761-765: req.sampler.temp = body.value("temperature", sd.has_temperature ? sd.temperature : 0.0f); When the bench includes the field, the body.value() never reads the card default. Effect: gemma4 (card recommends temp=1.0/top_p=0.95/ top_k=64) was forced to greedy on every bench, triggering its known `- - - -` degenerate-decode collapse (see docs/experiments/gemma4-26b- thinking-control-2026-05-25.md). The just-completed sindri sweep shows 3.3% pass on think mode and 47.8% on nothink vs bragi's 81.5% / 78.3% — entirely explained by this miss. Inconsistent prior behaviour: top_k was already conditional (sent only when > 0). temperature and top_p were not. This patch makes all three consistent — sent only when explicitly set via flags. Default behaviour now is "let the server pick", which means card defaults for dflash, provider defaults for OpenRouter/Anthropic. --sampling-from-card is kept as a deprecated no-op so existing scripts don't break (logs a notice if passed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The luce-bench package (https://github.com/easel/luce-bench) v0.2.3 now ships all of the HTTP-based capability benchmarks (ds4-eval, HumanEval, longctx, agent, forge) plus the thinking-control probe as a stdlib-only, uvx-installable framework with 53 tests. Folding luce-hub's duplicate copies in deletes ~20k lines of code that we were maintaining in two places. Deleted (Tier 1 — pure duplicates of luce-bench): bench_ds4_eval, bench_humaneval, bench_he_http, bench_longctx, bench_http_capability (1602 LOC), bench_http_frontiers, probe_thinking_control, bench_daemon, bench_server, lucebox_bench (987 LOC), test_lucebox_bench (622 LOC), bench_agentic_session, bench_agentic_tools Deleted (Tier 2 — vestigial native-binary dev-cycle benches; the test_dflash binary itself stays and the run.py / placement / quality / parity / examples scripts that drive it are unaffected): bench_he, bench_llm, bench_agent, bench_long_ctx, bench_agent_loop, bench_agent_cases Deleted (vendored fixtures now in luce-bench): fixtures/forge_eval/ (forge-guardrails 0.7.1 runtime + scenarios) fixtures/humaneval/cases.json fixtures/ds4_eval_cases.json Updated: pyproject.toml — add `luce-bench` dep pinned to git tag v0.2.3 (switch to PyPI version pin once trusted-publishing lands); drop deleted-file entries from ruff `include` list. dflash/pyproject.toml — drop the `eval` extra (anthropic SDK now lives behind luce-bench's `[forge]` extra). dflash/scripts/entrypoint.sh — `benchmark` subcommand now execs `python -m lucebench.cli` so `docker run image benchmark …` keeps working with the new framework. lucebox/lucebox/profile.py — drop 6 StepDefinitions that subprocessed deleted scripts (benchmark.http_frontiers, quality.capability_smoke, quality.ds4_eval, quality.capability_long, quality.agentic_tools, benchmark.agentic_session). Profile registry shrinks from 9 → 3 steps (health.props, benchmark.autotune_latest, test.python_unit pointing only at the lucebox/tests suite now). lucebox/tests/test_profile.py — point removed step IDs at remaining steps or delete the test (ds4_eval-specific argv test). .gitignore — exclude .claude/ (session worktrees + agent scratch). Validation: uv sync → luce-bench==0.2.3 from git uv run pytest lucebox/tests -q → 19/19 passed uv run ruff check → All checks passed uv run ruff format --check (touched) → clean Sweep continuity: the in-flight 26b --think sweep (luce-bench writing to baselines/bragi-rtx5090laptop-gemma4-26b-2026-05-26-sweep-think/) keeps running; no GPU disruption. Future cleanups (not in this commit): - PyPI publish luce-bench, switch `tool.uv.sources` to a plain version pin instead of git tag. - dflash/docs/{experiments,run-requests,RESULTS}.md still cite `bench_*.py` by name — those are historical refs that don't affect any runtime, can be swept on a docs-only pass.

The luce-bench-baselines repo (https://github.com/easel/luce-bench-baselines) is now the canonical home for all benchmark snapshot data. All 41 snapshot dirs (plus the handful of top-level SUMMARY/profile/log files) that lived under dflash/docs/tuning-snapshots/ have been mirrored over in luce-bench-baselines commit 067e7c9; this commit drops them from lucebox-hub so future clones don't carry 43MB of historical data the server build doesn't need. New sweeps already land in luce-bench-baselines directly via: uvx --from luce-bench luce-bench --sweep --name <host-model-date> \\ --out-dir /path/to/luce-bench-baselines \\ --base-url http://<server>:8080 .gitignore picks up the path so an accidental write here stays untracked. Mirror is verified: every dir/file removed here exists at the same name in the baselines repo.

The intent shipped in 925d41f's commit message but the file change got lost — the cat >> happened after the `git add` for that commit. Adding it now as a one-line follow-up so any future write to that path stays untracked (snapshots live in luce-bench-baselines).

Brings in 52 upstream commits since merge-base 8c23234 (2 days ago). The headline is PR Luce-Org#269 (`403e598 feat(cpp-server): thinking-budget v2 + multi-dialect reasoning + model-card sidecars`) — the squashed version of our own thinking-budget v2 work that we'd been carrying as 50+ small commits on this branch. Plus: - PR Luce-Org#262 howard0su/powerinfer: hybrid MoE spec-decode with DFlash draft, GPU/CPU FFN overlap, persistent pre-FFN graph for DeltaNet, 4-5x decode speedup, MoE perf telemetry - PR Luce-Org#263 weicj/feat-cpp-server-pflash-draft-placement: mixed-backend PFlash phase split - 648a6e2 perf: GPU-resident hybrid decode (eliminate PCIe round-trips) - d12ddde fix: dynamic placement uses --max-ctx instead of hardcoded 8192 Conflict resolution: where this branch has post-Luce-Org#269 refinements (gemma4 timings via 4e9abda, degenerate-decode watchdog via c2d725f / 8538ff9, laguna chat-template fix via 92f84cd, transition cue via 16bb31e, thinking-budget force-close fix via b86342d, etc.), keep ours. Where the only divergence is "this branch has 50 small commits that 403e598 squashes," accept the merged result. # Conflicts: # dflash/scripts/server.py # dflash/src/qwen35/qwen35_backend.cpp # dflash/src/qwen35/qwen35_backend.h # dflash/src/server/http_server.h # dflash/src/server/sse_emitter.cpp # dflash/test/test_server_unit.cpp # share/model_cards/laguna-xs.2.json

davide221 · 2026-05-26T21:56:38Z

@easel great work! I would like to promote to main asap. Can you rebase and integrate with harness autorun scripts?

…name) Brings in PR Luce-Org#281 (chore: rename dflash→server, pflash+megakernel → optimizations/) + small docs polish 080f89b. Our lucebox/ Python package (added by us in 2560086, never upstream) is untouched. Our docs additions under dflash/docs/* are migrated to server/docs/*. Our deletions of bench scripts confirmed against the new server/scripts/* paths. Workspace members in pyproject.toml: ["server", "lucebox", "optimizations/megakernel", "optimizations/pflash"] — preserving our lucebox member alongside upstream's renamed paths. # Conflicts: # README.md # pyproject.toml # server/docs/BENCHMARK_SNAPSHOT_SPEC.md # server/docs/experiments/cache-impact-2026-05-24.md # server/docs/experiments/gemma4-26b-thinking-control-2026-05-25.md # server/docs/experiments/kv-cache-q4-vs-tq3-2026-05-25.md # server/docs/experiments/thinking-control-protocol.md # server/docs/experiments/thinking-mechanism-explainer.md # server/docs/run-requests/area-swe-bench-integration.md # server/docs/run-requests/bragi-gemma4-laguna-config-issues.md # server/docs/run-requests/forge-vs-vidar-ds4f.md # server/docs/run-requests/luce-dflash-think-92.md # server/docs/run-requests/qwen36-budget-signaling-overhaul.md # server/docs/run-requests/qwen36-hard-limit-reply-budget-bump.md # server/docs/run-requests/sindri-rtx3090ti-qwen36-nothink-92.md # server/scripts/bench_agent.py # server/scripts/bench_agent_loop.py # server/scripts/bench_daemon.py # server/scripts/bench_he.py # server/scripts/bench_he_http.py # server/scripts/bench_llm.py # server/scripts/bench_server.py # server/scripts/entrypoint.sh # server/scripts/fixtures/agent_cases/cases.json # server/scripts/server.py # server/scripts/test_prefix_cache.py # server/scripts/test_server.py

v0.2.4 fixes the broken `--area forge` path that was raising `TypeError: EvalConfig.__init__() got an unexpected keyword argument 'client_factory'` against any HTTP backend. Upstream commit easel/luce-bench@59e01fc realigns areas/forge.py with the refactored EvalConfig dataclass + run_scenario(client, scenario, config) signature. No other behaviour changes in v0.2.4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The standalone github.com/easel/luce-bench repo was an awkward split for what is really part of the same engineering surface as the server. Bringing it in-tree so PRs, CI, and reviews live alongside the server work that the benches exercise. Layout: - luce-bench/ is now a uv workspace member alongside server, lucebox, optimizations/megakernel, optimizations/pflash. - `[tool.uv.sources] luce-bench = { workspace = true }` replaces the prior `git = "https://github.com/easel/luce-bench.git", tag = "v0.2.4"` pin. Future bumps land here directly. - luce-bench keeps its own pyproject.toml with `name = "luce-bench"` and a `version` it manages independently of the monorepo's release cadence — PyPI sees the same package name. Release flow (.github/workflows/release-luce-bench.yml): - Triggered on tag pushes matching `luce-bench-v*` (e.g. luce-bench-v0.2.5). - Asserts the tag's version suffix matches `luce-bench/pyproject.toml`. - Builds wheel + sdist from luce-bench/, publishes via PyPI trusted publishing (OIDC) under the `pypi` environment. Set up the trusted publisher in the PyPI project once: repo=easel/lucebox-hub, workflow=release-luce-bench.yml, environment=pypi. The standalone repo will be archived (read-only) — existing tag pins keep resolving for anyone consuming v0.2.4 or older from there. Files copied at v0.2.4 (commit easel/luce-bench@59e01fc): src/, tests/, README.md, NOTICE, LICENSE, pyproject.toml. 53 lucebench tests pass against the in-tree workspace install. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bench-script cleanup (89b1dfe) left profile.py hollowed out — its 6 benchmark StepDefinitions were subprocessing scripts that no longer exist, leaving only 3 trivial probes (health.props, autotune-report read, pytest). The snapshot/hash/dedup machinery was sitting on a near-empty registry. The whole point of `lucebox profile` is "capture performance snapshots that feed autotune" — that's exactly what luce-bench now produces. Wire the framework to it instead of deleting it. Adds 4 luce-bench-driven StepDefinitions to the registry: - benchmark.code — `lucebench.cli --area code` (HumanEval, 10 cases) - benchmark.longctx — `lucebench.cli --area longctx` (6 cases) - benchmark.agent — `lucebench.cli --area agent` (4 cases, mt=4096) - quality.ds4_eval — `lucebench.cli --area ds4-eval --think --max-tokens 16000 --timeout 1800` (full 92-case suite, score-only) Each one builds its argv via the new `_luce_bench_area_argv()` factory. Output JSON lands in the framework-owned dest dir so the existing snapshot/hash/dedup pipeline ingests it without changes. The framework's content-addressed hash (by hardware + model + tunables) is unchanged — re-running profile with the same config short-circuits to the cached snapshot; changing any tunable forces a fresh capture. Test coverage: - test_luce_bench_argv_shape_for_each_area — verifies argv shape for all 4 luce-bench steps + the per-step knobs (think mode, max_tokens, timeout, model). Catches regressions in the argv builder without needing a live server. Plus a small Dockerfile fix: UV_PYTHON_INSTALL_DIR=/opt/uv/python so the venv's python interpreter is world-readable. Default location is `/root/.local/share/uv/python/` which non-root container UIDs cannot traverse — broke `lucebox.sh check` and every other host-wrapper subcommand. The container runs as the host UID for bind-mount sanity (config files in $HOME stay user-owned), so the interpreter has to live somewhere world-traversable. Lucebox tests: 20/20 pass. Ruff + format clean.

PR Luce-Org#281 moved dflash/ → server/. The pull_request `paths:` filter still targeted dflash/* — so PRs touching the C++ server code wouldn't trigger the Docker prebuild sanity check. Repoint to server/ so CI catches Dockerfile / source regressions before merge.

server_main.cpp had two identical 119-line blocks of the thinking-budget v2 model-card resolution code (general.name / general.architecture read, resolve_model_card(), ServerConfig application, tier clamping). g++ errored out on redeclarations: redeclaration of 'std::string general_name' redeclaration of 'std::string general_arch' redeclaration of 'dflash::common::ModelCard card' redeclaration of 'const int tier_ceiling' conflicting declaration 'auto clamp_tier' The duplicate was a merge artifact from 1df9099 (luce-org/main into integration/-clean post-rename). Upstream PR Luce-Org#269 squashed our pre-PR work into 403e598 and our integration branch carried the same code in unsquashed form; the 3-way merge kept both copies. Also drop a redundant `#include "gguf.h"` (lines 23 and 25 of the same file). Harmless thanks to include guards but ugly merge residue. Build proceeds past cmake configure with the dedupe.

GitHub Actions only picks up workflows from the repo-root `.github/workflows/`; the nested `luce-bench/.github/workflows/ci.yml` was inherited from the standalone repo but never fires here. Its publish job is also superseded by `release-luce-bench.yml`. The nested `.gitignore` mostly duplicated root entries; moved its one unique pattern (`luce-bench/snapshots/` for --sweep output) into the root `.gitignore`. Also fixes a stale `-> "_RecordingAnthropicClient"` forward reference in areas/forge.py that the root ruff configuration flags (the class is in scope where the annotation is evaluated; the quotes are dead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR 490ff95 absorbed luce-bench into the monorepo as a uv workspace member, but didn't update the Dockerfile. The runtime stage's `uv sync` would fail looking for /opt/lucebox-hub/luce-bench/pyproject.toml because: 1. The builder stage's COPY block (~line 115-120) listed lucebox, optimizations/pflash, optimizations/megakernel — but not luce-bench. uv sync needs every workspace member's pyproject.toml to be present in the build context. 2. The runtime stage's COPY --from=builder block (~line 168) only pulled lucebox, server, optimizations across — not luce-bench. Add the two COPYs (builder source + runtime install) so the workspace resolution path is complete. No changes to cmake stage; CUDA build cache should still hit.

Wraps luce-bench in the same start-server → run-client → save-logs → stop-server pattern as the other harness/clients/run_*.sh launchers (run_codex.sh, run_claude_code.sh, etc.). Since luce-bench is just an HTTP client of /v1/chat/completions, it fits the existing client abstraction natively. Why: operators get a uniform way to invoke luce-bench ("did this server change break it?") alongside real-client smoke tests. A regression in luce-bench surfaces in the harness sweep matrix the same way an OpenCode or Hermes regression does — same launcher contract, same logs, same finish_report shape. Defaults: --no-think (4x faster on gemma-4-26b per the 2026-05-26 think/nothink comparison), full sweep mode (--sweep), 300s per-case timeout, single-thread. Knobs (env): LUCEBENCH_AREA single area override (else --sweep) LUCEBENCH_THINK 1 → --think, 0 → --no-think (default 0) LUCEBENCH_MAX_TOKENS per-request decode cap override LUCEBENCH_TIMEOUT per-case wall timeout (default 300s) LUCEBENCH_PARALLEL in-flight concurrency (default 1) All harness/common.sh knobs apply (MODEL_SERVER, LUCEBOX_SERVER_BACKEND, MAX_CTX, BUDGET, EXTRA_SERVER_ARGS, etc.). Output: $LOG_DIR/lucebench-{area,sweep}.{json,md} + lucebench.out (stdout/stderr) + server.log. Slots into the existing run-dir layout under /workspace/lucebox-client-harness-runs/<stamp>/. Docs entry added to harness/clients/README.md.

Promotes harness/ from "loose shell scripts + one stdlib Python file" to a proper uv workspace member that owns the "run X against a Lucebox server" abstraction. Both lucebox profile and harness/clients/*.sh launchers now go through the same Python entry point. What's new: - harness/pyproject.toml — name = "harness", dependencies = ["luce-bench"]. Stdlib-only at runtime (luce-bench itself is stdlib-only; anthropic only via [forge] extra). Fresh test boxes can install with zero external wheel downloads. - harness/harness/ package: - bench.run_bench(base_url, area, ...) — Python function form of harness/clients/run_lucebench.sh. Composes the lucebench.cli argv internally. Single source of truth for "run a luce-bench area against a server", returns the parsed JSON snapshot. - clients/claude_code.launch(base_url, model, prompt, interactive) + claude_env() helper — the env-var contract that points Claude Code at a Lucebox server (ANTHROPIC_BASE_URL, telemetry-off knobs, NONSTREAMING_FALLBACK kill). Used by both the new `lucebox claude` subcommand and the existing harness/clients/run_claude_code.sh wrapper. - lucebox/lucebox/cli.py — new `claude` subcommand. Probes for a live /health, looks up base URL, exec's claude on the host with the right env via harness.clients.claude_code.launch. Interactive by default (full TUI); `--prompt` makes it a one-shot run. - lucebox/lucebox/profile.py — _luce_bench_area_argv now delegates to `python -m harness.bench` instead of `python -m lucebench.cli`. All four bench StepDefinitions (code, longctx, agent, ds4-eval) ride that path. Framework still owns the JSON snapshot path via --json-out (new arg added to harness.bench too). - pyproject.toml — adds harness to workspace members + sources, declares harness as a root dep so uv sync installs it. Tests: - lucebox/tests/test_profile.py — updated argv-shape assertion for every bench step: expects `harness.bench` instead of `lucebench.cli`. - All 20 lucebox tests pass. luce-bench's 53 tests still pass. - Ruff + format clean across lucebox/ + harness/. Validation (live, against newly-rebuilt gemma-4-26b server): - lucebox-hub:cuda12 image rebuilt with permission fix + workspace luce-bench + harness COPY (separate Dockerfile commit). - `docker run lucebox-hub:cuda12 serve` brought gemma up in ~15s. - `python -m harness.bench --area forge --base-url http://localhost:8080` ran 30 cases against gemma in both think + nothink modes (0/30 both, ValidationError — separate luce-bench[forge] adapter bug, not a regression of this PR). This is the "interface luce-bench through the harness" shape the README hinted at — one Python module, one CLI, one shell wrapper, all converging on the same env-config + argv-building logic.

+ Dockerfile COPY harness + run_lucebench.sh via harness.bench Builds on 7bbf9af (harness as workspace member). All six harness/clients/ launchers now exist as both shell wrappers AND Python modules with the same launch() contract. Five new `lucebox <client>` Typer subcommands. New launchers in `harness.harness.clients`: - codex (writes config.toml; Responses API wire format) - opencode (writes opencode.json; AI-SDK OpenAI-compatible provider) - hermes (writes config.yaml + .env; chat_completions wire format) - pi (writes agent/{settings,models}.json; openai-responses) - openclaw (writes JSON config patch merged at startup) Each module: - Resolves binary via $<X>_BIN env, $PATH, or test-box convention ($CLIENT_WORK_DIR/clients/<x>/...) — shared `_common.find_bin()` - Writes per-run config into a tempdir so the user's real client state is untouched - exec()s with stdio inheritance (interactive TUI) or stdin from /dev/null + optional timeout (non-interactive `--prompt`) - Provides a `main()` for ad-hoc CLI use (harness-codex, etc.) - Stdlib-only at runtime Five new `lucebox <client>` Typer subcommands in lucebox/lucebox/cli.py: - `lucebox codex|opencode|hermes|pi|openclaw [--prompt P] [--url U] [--model M]` - Shared `_detect_server_url()` probes the standard localhost/docker bases for /health, picks the first responder - Shared `_exec_client()` does the launcher dispatch + typer.Exit translation - Each subcommand is ~10 lines: import the launcher, call the helper Dockerfile: - COPY harness /src/harness in the builder stage (alongside lucebox, luce-bench, optimizations/{pflash,megakernel}) so uv sync resolves the workspace member - COPY --from=builder /src/harness /opt/lucebox-hub/harness in the runtime stage so profile.py inside the container can `python -m harness.bench` (the path it uses since 7bbf9af) harness/clients/run_lucebench.sh: - Switch the underlying call from `python -m lucebench.cli` to `python -m harness.bench` for consistency with profile.py's delegation path. Both go through harness.bench now. Tests: 20/20 lucebox tests still pass. Ruff + format clean. Out of scope (deferred): - openwebui + openwebui-tools — separate web service lifecycle (start-server-in-background, poll for ready, etc.) — port as a follow-up if needed - lucebox.sh host-side dispatch — currently `lucebox.sh <client>` routes into the container (where the client binary isn't installed); need a host-side `cmd_client` that runs the client binary on the host. Works today via `uv run python -m lucebox <client>` directly.

…/LICENSE) uv builds harness inside an isolated sandbox at /src/harness/, where the parent ../LICENSE file is not visible. hatchling errored out: OSError: License file does not exist: ../LICENSE hint: `harness` was included because `lucebox-hub` (v0.0.0) depends on `harness` Switch to inline `license = { text = "Apache-2.0" }` (same pattern lucebox/ and server/ use). Matches PyPA 2025 guidance and avoids the sandbox-path trap. The text of LICENSE itself stays at repo root.

Old auto-detect had a hardcoded Qwen3.6 preference: when both Qwen3.6-27B-Q4_K_M.gguf and gemma-4-26b-a4b-it-Q4_K_M.gguf were present in models/, Qwen always won, silently. This hid a real bug during the 2026-05-27 matrix run — the container was supposed to serve gemma4 but ran qwen for 10 hours of sweep-think before the operator noticed bench numbers were wrong (and the wrong-target draft slowed decode 4× because the qwen draft GGUF was q4_k_m vs gemma's q8_0, causing every think-mode case to hit the 300 s bench timeout). New behavior: - Find ALL .gguf candidates ≥5 GB outside models/draft/ (the size threshold cleanly excludes draft GGUFs without parsing GGUF arch metadata). - 0 candidates → die with clear message + DFLASH_TARGET hint - 1 candidate → use it, log "Auto-detected target: <name>" - 2+ candidates → WARN with the full list (marking the choice with *), tell the operator to pin DFLASH_TARGET=<path>, then pick the first alphabetically (deterministic across runs). Trade-off: gemma4 wins over qwen3.6 alphabetically. That's not a value judgment — it's just deterministic. The point is the warn-loudly path, not the choice of which model wins by default. Operators with both models present MUST set DFLASH_TARGET to skip the warning. The hardcoded Qwen3.6 family preference path is gone — fundamentally the wrong shape (silently picking based on a name pattern). If we want a "preferred model" knob later it should be DFLASH_PREFERRED_TARGET or similar, with the same warn-when-multiple-candidates rule.

easel · 2026-05-27T17:59:18Z

Superseded by a flattened single-commit branch feat/lucebox-docker — same scope (docker stack + lucebox CLI + bench/profile + harness + luce-bench in-tree), but collapsed onto current main since most of the original branch's server-side work landed separately via #269 / #281 / #262.

New PR forthcoming.

easel marked this pull request as draft May 19, 2026 13:13

cubic-dev-ai Bot reviewed May 19, 2026

View reviewed changes

easel force-pushed the integration/props-uv-squared-clean branch 3 times, most recently from 4d38d50 to 4621867 Compare May 20, 2026 16:46

easel force-pushed the integration/props-uv-squared-clean branch 2 times, most recently from 0067f9b to 743da47 Compare May 20, 2026 22:30

easel added 3 commits May 22, 2026 10:54

feat(dflash): align server props and thinking controls

5b67cf2

feat(lucebox): add release CLI and Docker prebuilds

2560086

feat(lucebox): add benchmark and profile evidence suite

84ddd04

easel force-pushed the integration/props-uv-squared-clean branch from 5c6b502 to 84ddd04 Compare May 22, 2026 14:59

easel mentioned this pull request May 22, 2026

feat(dflash): add /props introspection endpoint #190

Closed

2 tasks

easel and others added 17 commits May 22, 2026 15:49

easel and others added 9 commits May 25, 2026 21:44

davide221 mentioned this pull request May 26, 2026

chore(repo): rename dflash→server, group pflash+megakernel under optimizations/ #281

Merged

easel added 2 commits May 26, 2026 17:45

easel and others added 14 commits May 26, 2026 19:52

chore(deps): merge luce-bench[forge] + harness workspace member

9599f91

easel closed this May 27, 2026

easel mentioned this pull request May 27, 2026

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker, cli, smoke, bench and autotune first-run dx#226

docker, cli, smoke, bench and autotune first-run dx#226
easel wants to merge 135 commits into
Luce-Org:mainfrom
easel:integration/props-uv-squared-clean

easel commented May 19, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

easel commented May 20, 2026

Uh oh!

davide221 commented May 26, 2026

Uh oh!

easel commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easel commented May 19, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

easel commented May 20, 2026

Uh oh!

davide221 commented May 26, 2026

Uh oh!

easel commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cubic-dev-ai Bot left a comment •

edited

Loading