feat(server): derive model name from GGUF; default port 1236, ctx 128K by easel · Pull Request #47 · Luce-Org/lucebox-hub

easel · 2026-04-27T18:34:14Z

Summary

/v1/models now returns the GGUF filename stem (e.g. Qwen3.5-27B-Q4_K_M) instead of the hardcoded luce-dflash
Default port changed to 1236
Default --max-ctx changed to 131072 (128K validated on RTX 5090 with TQ3_0 KV cache)
Startup doc updated to use uv run

Test plan

uv run scripts/server.py starts on port 1236
GET /v1/models returns the GGUF stem as the model id
Chat completions work at 128K context

🤖 Generated with Claude Code

…, ctx 128K /v1/models now returns the GGUF stem (e.g. Qwen3.5-27B-Q4_K_M) instead of the hardcoded "luce-dflash". Default port changed to 1236, max_ctx to 131072 (validated on RTX 5090 with TQ3_0 KV cache), and startup updated to use uv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- src/dflash/server/ replaces scripts/server.py + scripts/server_tools.py - server/parsing.py: Qwen3.x tool-call + reasoning parsers (pure fns, no deps) - server/schemas.py: pydantic models for OpenAI + Anthropic APIs - Entry point: dflash-server = "dflash.server:main" (uv run dflash-server) - tests/: 33 tests covering parsing and all HTTP endpoints - Consolidates tool calling, streaming state machine, GGUF model name, port 1236, 128K ctx defaults from server_tools.py into single package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

easel · 2026-04-27T20:46:08Z

Closing - content absorbed into PR #43 (feat/setup-results-uv) which now contains all server/uv changes as a single clean commit rebased on top of the 5090 hardware support PR (#48).

- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K - Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB - NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget - Corrected bench script (run_skip_park_32k.py) ready for re-run - Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree

…rg#47) Two real bugs in the spec-decode → AR tail-off path, caught by audit: ## Bug 1: stale cache_.last_tok seed cache_.last_tok is set during do_prefill (qwen35_backend.cpp:701) and NEVER updated by the spec-decode commit loop — local `last_tok` is the authoritative most-recently-committed token. When spec-decode tailed off to AR, AR's first-token block read cache_.last_tok which still held the prefill's argmax (= prediction for position prompt_len, way stale by `n_generated` positions). AR's seed was wrong → garbage continuation. Fix: sync `cache_.last_tok = last_tok` at the tail-off site before calling do_ar_decode. ## Bug 2: AR first-token block re-emits + advances past KV state do_ar_decode's first-token block was designed for the post-prefill entry: it pushes cache_.last_tok (the prefill's predicted continuation) to out_tokens, emits it, and committed++. For the spec-decode tail-off case, that token has ALREADY been emitted by spec-decode and pushed to out_tokens — running the first-token block again duplicates the emission AND advances `committed` past where the KV cache actually has state. The next AR iteration then embeds the duplicated token at a position with empty/garbage KV, producing nonsense. Fix: skip the first-token block when out_tokens is non-empty (auto- detect continuation mode). Loop starts at i=0 in continuation mode (produces n_gen iterations) vs i=1 in standalone mode (1 first-token + n_gen-1 loop iterations = n_gen total). Heuristic chosen over an explicit param to avoid signature churn. ## Bug 3: tail-off passed original n_gen, would overshoot budget Spec-decode tail-off was passing `n_gen` unchanged (e.g. 16000) to do_ar_decode. With Bug 2 fixed (skip first-token), AR's loop would then iterate n_gen=16000 MORE times — total committed = spec_count + 16000 ≈ 31K tokens, 2× the request budget. Fix: pass `n_gen = need_commit_budget` (the remaining decode tokens in the request budget). With this AND continuation mode (skip first token), AR produces exactly `need_commit_budget` more tokens, force- close fires immediately on the first iter (remaining is at/below hard_limit), and the model gets ~512 budgeted tokens to write the visible answer. All three bugs were latent — bench was still producing 80% PASS rate because most cases close </think> early and never hit the tail-off path. The cases that DID tail-off were generating garbage but in a way that often still happened to extract a parseable answer. Audit task Luce-Org#47 done — bugs caught, fixed in one place each, minimal surface change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K - Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB - NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget - Corrected bench script (run_skip_park_32k.py) ready for re-run - Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree

easel and others added 2 commits April 27, 2026 14:33

easel closed this Apr 27, 2026

noonghunna mentioned this pull request May 4, 2026

Multi-GPU support: OOM error with dual RTX 3090 #42

Open

easel mentioned this pull request May 24, 2026

feat(cpp-server): thinking-budget v2 + multi-dialect reasoning aliases + spec #269

Merged

dusterbloom mentioned this pull request May 28, 2026

feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity #274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): derive model name from GGUF; default port 1236, ctx 128K#47

feat(server): derive model name from GGUF; default port 1236, ctx 128K#47
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-defaults

easel commented Apr 27, 2026

Uh oh!

easel commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

easel commented Apr 27, 2026

Summary

Test plan

Uh oh!

easel commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant