feat(server): derive model name from GGUF; default port 1236, ctx 128K#47
Closed
easel wants to merge 2 commits into
Closed
feat(server): derive model name from GGUF; default port 1236, ctx 128K#47easel wants to merge 2 commits into
easel wants to merge 2 commits into
Conversation
…, ctx 128K /v1/models now returns the GGUF stem (e.g. Qwen3.5-27B-Q4_K_M) instead of the hardcoded "luce-dflash". Default port changed to 1236, max_ctx to 131072 (validated on RTX 5090 with TQ3_0 KV cache), and startup updated to use uv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- src/dflash/server/ replaces scripts/server.py + scripts/server_tools.py - server/parsing.py: Qwen3.x tool-call + reasoning parsers (pure fns, no deps) - server/schemas.py: pydantic models for OpenAI + Anthropic APIs - Entry point: dflash-server = "dflash.server:main" (uv run dflash-server) - tests/: 33 tests covering parsing and all HTTP endpoints - Consolidates tool calling, streaming state machine, GGUF model name, port 1236, 128K ctx defaults from server_tools.py into single package Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
Author
dusterbloom
added a commit
to dusterbloom/lucebox-hub
that referenced
this pull request
May 21, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K - Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB - NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget - Corrected bench script (run_skip_park_32k.py) ready for re-run - Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
dusterbloom
added a commit
to dusterbloom/lucebox-hub
that referenced
this pull request
May 23, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K - Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB - NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget - Corrected bench script (run_skip_park_32k.py) ready for re-run - Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
dusterbloom
added a commit
to dusterbloom/lucebox-hub
that referenced
this pull request
May 23, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K - Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB - NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget - Corrected bench script (run_skip_park_32k.py) ready for re-run - Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 24, 2026
…rg#47) Two real bugs in the spec-decode → AR tail-off path, caught by audit: ## Bug 1: stale cache_.last_tok seed cache_.last_tok is set during do_prefill (qwen35_backend.cpp:701) and NEVER updated by the spec-decode commit loop — local `last_tok` is the authoritative most-recently-committed token. When spec-decode tailed off to AR, AR's first-token block read cache_.last_tok which still held the prefill's argmax (= prediction for position prompt_len, way stale by `n_generated` positions). AR's seed was wrong → garbage continuation. Fix: sync `cache_.last_tok = last_tok` at the tail-off site before calling do_ar_decode. ## Bug 2: AR first-token block re-emits + advances past KV state do_ar_decode's first-token block was designed for the post-prefill entry: it pushes cache_.last_tok (the prefill's predicted continuation) to out_tokens, emits it, and committed++. For the spec-decode tail-off case, that token has ALREADY been emitted by spec-decode and pushed to out_tokens — running the first-token block again duplicates the emission AND advances `committed` past where the KV cache actually has state. The next AR iteration then embeds the duplicated token at a position with empty/garbage KV, producing nonsense. Fix: skip the first-token block when out_tokens is non-empty (auto- detect continuation mode). Loop starts at i=0 in continuation mode (produces n_gen iterations) vs i=1 in standalone mode (1 first-token + n_gen-1 loop iterations = n_gen total). Heuristic chosen over an explicit param to avoid signature churn. ## Bug 3: tail-off passed original n_gen, would overshoot budget Spec-decode tail-off was passing `n_gen` unchanged (e.g. 16000) to do_ar_decode. With Bug 2 fixed (skip first-token), AR's loop would then iterate n_gen=16000 MORE times — total committed = spec_count + 16000 ≈ 31K tokens, 2× the request budget. Fix: pass `n_gen = need_commit_budget` (the remaining decode tokens in the request budget). With this AND continuation mode (skip first token), AR produces exactly `need_commit_budget` more tokens, force- close fires immediately on the first iter (remaining is at/below hard_limit), and the model gets ~512 budgeted tokens to write the visible answer. All three bugs were latent — bench was still producing 80% PASS rate because most cases close </think> early and never hit the tail-off path. The cases that DID tail-off were generating garbage but in a way that often still happened to extract a parseable answer. Audit task Luce-Org#47 done — bugs caught, fixed in one place each, minimal surface change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 26, 2026
- Single condition: ee7 (EARLY_EXIT_N=7, SCORE_LAYERS=7) + --prefill-skip-park at 32K - Server startup confirmed OK, no cuMemSetAccess crash, peak VRAM 16.6 GB - NIAH/drafter_fwd data incomplete: safety classifier down for >20 min during 30 min budget - Corrected bench script (run_skip_park_32k.py) ready for re-run - Re-run: python3 dflash/bench/run_skip_park_32k.py from worktree
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/v1/modelsnow returns the GGUF filename stem (e.g.Qwen3.5-27B-Q4_K_M) instead of the hardcodedluce-dflash1236--max-ctxchanged to131072(128K validated on RTX 5090 with TQ3_0 KV cache)uv runTest plan
uv run scripts/server.pystarts on port 1236GET /v1/modelsreturns the GGUF stem as the model id🤖 Generated with Claude Code