Skip to content

feat(server): derive model name from GGUF; default port 1236, ctx 128K#47

Closed
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-defaults
Closed

feat(server): derive model name from GGUF; default port 1236, ctx 128K#47
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-defaults

Conversation

@easel

@easel easel commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • /v1/models now returns the GGUF filename stem (e.g. Qwen3.5-27B-Q4_K_M) instead of the hardcoded luce-dflash
  • Default port changed to 1236
  • Default --max-ctx changed to 131072 (128K validated on RTX 5090 with TQ3_0 KV cache)
  • Startup doc updated to use uv run

Test plan

  • uv run scripts/server.py starts on port 1236
  • GET /v1/models returns the GGUF stem as the model id
  • Chat completions work at 128K context

🤖 Generated with Claude Code

easel and others added 2 commits April 27, 2026 14:33
…, ctx 128K

/v1/models now returns the GGUF stem (e.g. Qwen3.5-27B-Q4_K_M) instead of
the hardcoded "luce-dflash". Default port changed to 1236, max_ctx to 131072
(validated on RTX 5090 with TQ3_0 KV cache), and startup updated to use uv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/dflash/server/ replaces scripts/server.py + scripts/server_tools.py
- server/parsing.py: Qwen3.x tool-call + reasoning parsers (pure fns, no deps)
- server/schemas.py: pydantic models for OpenAI + Anthropic APIs
- Entry point: dflash-server = "dflash.server:main" (uv run dflash-server)
- tests/: 33 tests covering parsing and all HTTP endpoints
- Consolidates tool calling, streaming state machine, GGUF model name,
  port 1236, 128K ctx defaults from server_tools.py into single package

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@easel

easel commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator Author

Closing - content absorbed into PR #43 (feat/setup-results-uv) which now contains all server/uv changes as a single clean commit rebased on top of the 5090 hardware support PR (#48).

@easel easel closed this Apr 27, 2026
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K
- Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB
- NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget
- Corrected bench script (run_skip_park_32k.py) ready for re-run
- Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 23, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K
- Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB
- NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget
- Corrected bench script (run_skip_park_32k.py) ready for re-run
- Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 23, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K
- Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB
- NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget
- Corrected bench script (run_skip_park_32k.py) ready for re-run
- Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
easel added a commit to easel/lucebox-hub that referenced this pull request May 24, 2026
…rg#47)

Two real bugs in the spec-decode → AR tail-off path, caught by audit:

## Bug 1: stale cache_.last_tok seed

cache_.last_tok is set during do_prefill (qwen35_backend.cpp:701) and
NEVER updated by the spec-decode commit loop — local `last_tok` is the
authoritative most-recently-committed token. When spec-decode tailed
off to AR, AR's first-token block read cache_.last_tok which still
held the prefill's argmax (= prediction for position prompt_len, way
stale by `n_generated` positions). AR's seed was wrong → garbage
continuation.

Fix: sync `cache_.last_tok = last_tok` at the tail-off site before
calling do_ar_decode.

## Bug 2: AR first-token block re-emits + advances past KV state

do_ar_decode's first-token block was designed for the post-prefill
entry: it pushes cache_.last_tok (the prefill's predicted continuation)
to out_tokens, emits it, and committed++. For the spec-decode tail-off
case, that token has ALREADY been emitted by spec-decode and pushed to
out_tokens — running the first-token block again duplicates the
emission AND advances `committed` past where the KV cache actually has
state. The next AR iteration then embeds the duplicated token at a
position with empty/garbage KV, producing nonsense.

Fix: skip the first-token block when out_tokens is non-empty (auto-
detect continuation mode). Loop starts at i=0 in continuation mode
(produces n_gen iterations) vs i=1 in standalone mode (1 first-token
+ n_gen-1 loop iterations = n_gen total). Heuristic chosen over an
explicit param to avoid signature churn.

## Bug 3: tail-off passed original n_gen, would overshoot budget

Spec-decode tail-off was passing `n_gen` unchanged (e.g. 16000) to
do_ar_decode. With Bug 2 fixed (skip first-token), AR's loop would
then iterate n_gen=16000 MORE times — total committed = spec_count +
16000 ≈ 31K tokens, 2× the request budget.

Fix: pass `n_gen = need_commit_budget` (the remaining decode tokens
in the request budget). With this AND continuation mode (skip first
token), AR produces exactly `need_commit_budget` more tokens, force-
close fires immediately on the first iter (remaining is at/below
hard_limit), and the model gets ~512 budgeted tokens to write the
visible answer.

All three bugs were latent — bench was still producing 80% PASS rate
because most cases close </think> early and never hit the tail-off
path. The cases that DID tail-off were generating garbage but in a
way that often still happened to extract a parseable answer.

Audit task Luce-Org#47 done — bugs caught, fixed in one place each, minimal
surface change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 26, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K
- Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB
- NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget
- Corrected bench script (run_skip_park_32k.py) ready for re-run
- Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant