refactor(server): proper Python package with tool-calling, tests, uv setup by easel · Pull Request #43 · Luce-Org/lucebox-hub

easel · 2026-04-27T14:50:28Z

Summary

Replaces ad-hoc scripts/server*.py with a proper dflash Python package under src/dflash/server/
OpenAI-compatible API with modular parsing, schemas, and tool-calling support
pytest coverage for parsing and server endpoints
pyproject.toml with [build-system], dflash-server entry point, and uv.lock
Removes old scripts/server.py, scripts/server_tools.py, scripts/test_server.py

Base: PR #48 (fix/consumer-blackwell-auto-detect) — RTX 5090 hardware support

Test plan

uv sync — dflash-server entry point installs without warnings
pytest dflash/tests/ passes
dflash-server --model <gguf> starts and serves /v1/chat/completions

🤖 Generated with Claude Code

huggingface-cli is deprecated; the current CLI is `hf`. Updates all three occurrences in README.md, the setup script log messages, and the CONTRIBUTING.md dependency table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds pyproject.toml declaring all script dependencies (transformers, numpy, gguf, fastapi, uvicorn, jinja2, pytest, httpx) with torch gated behind an [oracle] optional. Use `uv sync` to install. Replaces pipx with uv in setup_system.sh: installs uv via the official astral.sh installer for $SUDO_USER, then uses `uv tool install` for hf. Updates README quick-start server block and CONTRIBUTING.md to use `uv sync` / `uv run` instead of manual venv + pip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sm_86 fails on non-Ampere GPUs (e.g. RTX 5090 is sm_120). Using CMAKE_CUDA_ARCHITECTURES=native lets CMake auto-detect the installed GPU so the build works on any supported card. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds per-prompt and summary results from RTX 5090 Laptop (sm_120, CUDA 13.2): HumanEval 87.30 tok/s 3.64×, GSM8K 70.92 tok/s 2.98×, Math500 72.97 tok/s 3.07×. Lower absolute AR (~24 vs ~38 tok/s) due to laptop power limits; speedup ratio holds and HumanEval improves to 3.64× at AL 8.49. Also adds `datasets` to pyproject.toml (required by bench_llm.py) and updates the Reproducibility section to use `uv run`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wire TQ3_0 (TurboQuant 3.5bpv) into dflash's custom graph builder, enabling 22% KV memory reduction vs Q4_0 with identical decode speed. Changes: - Add DFLASH27B_KV_TQ3 env override (--kv-tq3 flag in run.py) - Store kv_k_type in TargetCache for downstream use - Pad cache allocation to 256-aligned for TQ3_0 FA stride requirements - Apply FWHT rotation to Q before FA, un-rotate output from V space - Pad kv_len_padded to 256-aligned for TQ3_0 (FA vec kernel requirement) - Update test_dflash stride padding for TQ3_0 Requires llama.cpp submodule with TQ3_0 support (see Luce-Org/llama.cpp PR #1)

PR #1 on Luce-Org/llama.cpp-dflash-ggml (TQ3_0 KV cache) merged at 1823460. Bump from feature branch tip 1372283 to canonical merge commit so hub tracks luce-dflash head. Co-Authored-By: WOZCODE <contact@withwoz.com>

- server.py / server_tools.py auto-enable DFLASH27B_KV_TQ3 (was Q4) for max_ctx > 6144; explicit DFLASH27B_KV_Q4=1 still wins. - Quickstart 128K example and bench_daemon docstring switched to TQ3. - README narrative bumps long-context ceiling to 256K per PR #1 on llama.cpp-dflash-ggml (TQ3_0 = 3.5 bpv, 22% smaller than Q4_0). - Remove "TurboQuant KV cache" from Contributing roadmap (shipped). Behavior change: servers auto-enable path previously defaulted to Q4_0 and now defaults to TQ3_0 above the 6144 context threshold. Co-Authored-By: WOZCODE <contact@withwoz.com>

TQ3_0 KV cache raises the RTX 3090 ceiling to 262144 tokens per PR #1 on Luce-Org/llama.cpp-dflash-ggml. The README quickstart example still showed `up to 131072` and was labeled "128K context mode:". Co-Authored-By: WOZCODE <contact@withwoz.com>

Persistent daemon shipped in Luce-Org#7 (feat(dflash): implement persistent daemon mode for server.py). The bullet under Scope and limits was still claiming per-request respawn and ~10 s first-token latency. Co-Authored-By: WOZCODE <contact@withwoz.com>

Co-Authored-By: WOZCODE <contact@withwoz.com>

…ADME TQ3_0 (3.5 bpv) raises the 24 GB RTX 3090 ceiling from 128K to 256K per PR #1 on Luce-Org/llama.cpp-dflash-ggml. Keep the 134.78 tok/s Q4_0 benchmark as a historical reference point at 128K. Co-Authored-By: WOZCODE <contact@withwoz.com>

Experiments and benchmarks remain RTX 3090 (Ampere). README now documents that dflash builds on RTX 5090 (sm_120, CUDA 12.8+) and GB10 / DGX Spark / Jetson Thor (sm_121, CUDA 12.9+) with no source changes, since dflash/CMakeLists.txt already auto-adds those archs. - Drop -DCMAKE_CUDA_ARCHITECTURES=86 from dflash quickstart so CMake auto-selection actually kicks in for newer GPUs. - Add 'Running on other GPUs' subsection with compat table, verify snippet, DGX Spark quick start, and callouts for what will NOT auto-port (DDTree budget=22, Q4_0 KV ring, perf numbers). - Rewrite Requirements with per-arch minimum CUDA and a megakernel porting note (edit sm_XX + NUM_BLOCKS in setup.py, one block per SM).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TQ3_0 costs ~0.7 AL and ~10 tok/s vs default at short contexts. Notes that a long-context sweep comparing TQ3 vs Q4_0 on the 5090 has not been run yet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hon3 These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware support PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

davide221 · 2026-04-28T08:45:32Z

@easel thanks for the contribution! I want to pinpoint that as the number of projects growth we will centralize the server management outside of any specific project folder in the next weeks.

CUDA 13.2 resolves CMAKE_CUDA_ARCHITECTURES=native to sm_120a on Blackwell machines, but consumer RTX 5090 (SM 12.0) lacks FP4 tensor cores and faults with CUDA_ERROR_ILLEGAL_INSTRUCTION when running sm_120a kernels. Fix: query nvidia-smi at cmake configure time to get the exact compute capability (e.g. "12.0" → "120") and set CMAKE_CUDA_ARCHITECTURES before add_subdirectory so ggml-cuda inherits the correct value. Also sets GGML_CUDA_BLACKWELL_CONSUMER=ON for SM 12.x targets, which (in the updated submodule) skips ggml's 12X→12Xa arch replacement and excludes mmq FP4 kernel instances that require sm_120a. Falls back to the compiler-version-based arch list when nvidia-smi is absent (CI, headless) so builds without a GPU still work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…p script Add RESULTS.md with Q4_0 vs TQ3_0 long-context benchmarks (up to 256K tokens) on RTX 5090 Laptop GPU, plus bench_long_ctx.py sweep script, and README updates noting hardware requirements and benchmark methodology. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hon3 These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware support PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…setup Replace ad-hoc scripts/server*.py with a proper dflash Python package: - src/dflash/server/ with modular parsing, schemas, and OpenAI-compatible API - tests/ with pytest coverage for parsing and server endpoints - pyproject.toml with build-system, entry point (dflash-server), and uv.lock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- 'none': suppresses tools from apply_chat_template so model never sees them - 'required' / named-function dict: forwarded as tool_choice kwarg to the Qwen3.x chat template, which adds forcing instructions to the prompt - any other value: returns 400 {error:{code:'unsupported_parameter',param:'tool_choice'}} instead of silently ignoring it and burning the full token budget Fixes the conformance gap reported in lucebox-tool-support-2026-04-27.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ints Streams 8 typical code-agent prompts through /v1/chat/completions and reports TTFT, decode tok/s, and completion token count per prompt. Configurable URL so the same script can compare dflash vs LM Studio or any other endpoint. uv run scripts/bench_server.py # dflash :1236 uv run scripts/bench_server.py --url http://host:1234 # LM Studio uv run scripts/bench_server.py --repeat 3 --n-gen 256 # averaged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

--target now defaults to the highest-version Qwen*.gguf found under models/, so dropping in a new model file (e.g. Qwen3.6) is picked up without a flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-resolved by uv; updates platform/python markers on cuda-bindings and nvidia-* optional deps. No version changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Track overall_tok_s = n_tok / total_wall alongside decode_tok_s - TTFT now fires on reasoning_content delta (Qwen3 thinking prefix) instead of only on content, fixing 0 tok/s reports in thinking mode - Remove enable_thinking:false override so both servers run equivalently Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bench_server.py was 8 hand-written ~100-token code-agent prompts — useful for short-prompt decode tok/s but not representative of real agentic workloads (5K–12K token prompts, prefill-dominant). --replay loads a ddx-style sessions.jsonl (each row has a top-level `prompt` field) and reissues each as a single streaming chat completion against the configured server. Output adds a p_chars column and per-bucket aggregation (small <2K / medium 2K–8K / large >8K) so prefill-vs-decode trends are visible. Counterpart ddx beads filed for trace-schema gaps: - ddx-b3b9501a: latency_ms field is mislabeled (= cumulative elapsed_ms) - ddx-969c5500: agent-logs schema is metadata-only; needs verbose mode Counterpart agent beads filed: - agent-a8915e01: RotatingKVCache Quantization NYI not classified - agent-e5f0b894: reasoning-stall errors should be structured Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…PError mid-stream The 8b7ad4e refactor added overall_tok_s to ProbResult but missed the URLError handler, so any mid-stream provider failure crashed with TypeError. Also catch generic Exception so 502 / connection-reset errors become a row-level error rather than a fatal crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ctor) PR Luce-Org#13 (upstream) lowered the daemon default from 131072 → 16384 to avoid the FA-stride / VRAM-cliff trap documented in issue Luce-Org#10. The package refactor in 1a86289 inadvertently restored 131072 when moving the CLI into src/dflash/server/__init__.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pt replay The previous --replay mode treated each session's prompt as a single one-shot LLM call. That's wrong: real agentic sessions accumulate tool results turn-over-turn, growing per-call input from ~5K chars at turn 1 to 60K-300K chars by the final turn. A first-turn-only replay understates the real workload by an order of magnitude. Synthetic tool results were the wrong fix — fidelity matters here. This commit drops the synthetic path entirely and adds --transcript: load a recorded Claude Code session jsonl from ~/.claude/projects/<workspace>/<uuid>.jsonl, walk it call by call, send the exact message prefix that was originally sent at each point, and measure TTFT + decode tok/s. Tool I/O comes from the recording so every server under test sees an identical per-call input distribution. uv run scripts/bench_server.py --transcript path/to/session.jsonl Repeat --transcript to bench multiple sessions. --max-calls caps work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduce the text-only agent-code-text serving profile with 48K context, prompt admission limits, reserved generation headroom, safe default env wiring, and KV cache type passthroughs. Reject multimodal content for the profile and keep prefix cache disabled. Document the deferred prefix-cache design and implementation plan, and cover the profile behavior with server tests.

# Conflicts: # CONTRIBUTING.md # README.md # dflash/README.md # dflash/RESULTS.md # dflash/pyproject.toml # dflash/scripts/run.py # dflash/scripts/server.py # dflash/scripts/server_tools.py # dflash/scripts/setup_system.sh # dflash/src/internal.h # dflash/src/qwen35_target_graph.cpp # dflash/test/test_dflash.cpp

easel · 2026-05-05T13:49:30Z

@easel thanks for the contribution! I want to pinpoint that as the number of projects growth we will centralize the server management outside of any specific project folder in the next weeks.

LMK how I can help out here. I've got some notes in a README.md regarding the packages to install and have this pyproject.toml as well. I'd definitely recommend going with uv. I'm going to close this out for now to avoid the clutter but happen to re-wire it.

Closes two of the three feature gaps between dflash/scripts/server.py (Python, reference impl) and dflash/src/server/dflash_server (C++, production runtime), as outlined in the migration plan. ## /props introspection Wire /props in http_server.cpp::handle_client. JSON shape matches server.py:1221-1312 key-for-key so cross-server consumers (autotune sweeps, dashboards, lucebox profile/snapshot) see a stable contract. ServerConfig grows the introspection inputs (arch, model_path, draft_path, kv_cache_k/v, runtime_backend, fa_window, ddtree_budget, speculative_enabled, target_sharding, tokenizer_id) populated by server_main before HttpServer construction. PrefixCache and ToolMemory gain stats() / full_stats() accessors with the same lockless-snapshot semantics the Python impl documents — a mutation under daemon_lock can tear in_use vs lifetime_hits across the read pair; acceptable for /props. PROPS_SCHEMA = 1 (matches Python's current schema). Bump only on breaking changes: field renamed, removed, or semantics-changed. ## /v1/messages/count_tokens Reuses the Anthropic message-parsing path from /v1/messages, short- circuits after tokenization with {"input_tokens": N}. <1s on a hot server (no generation). ## Tests Integration coverage in dflash/scripts/test_server_integration.py: - TestProps: top-level keys, server block shape, speculative_mode consistency, runtime backend, arch-gated capabilities, API endpoint registry, prefix_cache/tool_replay shapes. - TestCountTokens: simple count, scaling with message length, system block handling, <1s budget assertion. Thinking-budget surface (--think-max-tokens flag, finish_details, thinking_opt_in tracking) is the second PR in this migration — intentionally separated so the algorithm-design review (Level 1/2/3 fidelity question) doesn't block /props from landing. Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean with CUDA layers cached. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_details) Closes the third feature gap in the Python→C++ server migration: the thinking-budget wire surface so consumers (lucebox bench, dashboards) can opt in to the envelope and see the close-info block on responses. ## What ships - `--think-max-tokens` CLI flag (default 10000): cap on the phase-1 reasoning generation when a request opts in. - `--default-max-tokens` CLI flag (default 16000, matches antirez/ds4 ds4_eval.c's `.max_tokens`): combined cap used when a request omits max_tokens. - `thinking: {type: "enabled"}` is now tracked as a presence-opt-in (`ParsedRequest.thinking_opt_in`) so the server can condition response shape on it. - `finish_details` block on /v1/chat/completions responses when thinking is opted in. Fields match docs/specs/thinking-budget.md:43-58 and server.py:2272 (close_kind, thinking_tokens, content_tokens, total_tokens). ## What's deferred (intentional) The phase-1/phase-2 reprompt MECHANISM (port of server.py:2141-2196) is not in this PR. close_kind always reports "natural" for now — the C++ server doesn't yet force-close on hard_limit. The Python server's existing behavior continues to be the reference impl. Why ship the surface first: - Unblocks consumers (the lucebox bench can stop sending custom envelope fields and just send standard OpenAI + this opt-in). - Lets us land /props first without algorithm-design review on this PR. - The Level 1 vs Level 2 vs Level 3 fidelity question (phase-1/phase-2 reprompt vs true mid-stream force-close vs full eval_think_close_info reporting) is a separate design conversation — the wire shape stays the same regardless of which Level lands. ## Tests Integration coverage in test_server_integration.py: - TestThinkingBudget.test_finish_details_present_when_thinking_opted_in: asserts the block is emitted with valid types when `thinking:{type:enabled}` is sent; invariant `thinking_tokens + content_tokens == total_tokens`. - TestThinkingBudget.test_finish_details_absent_when_thinking_not_opted_in: asserts the block is NOT emitted on a plain chat request, matching Python's server.py:2271 conditional. ## Follow-ups (separate PRs) - Level 1 phase-1/phase-2 reprompt: port server.py:2141-2196 into worker_loop. Sets close_kind="hard" when phase-1 didn't emit </think> within think_max_tokens. ~1 day. - Level 2 true in-process force-close: backend-level sampler hook, matches ds4_eval.c:3027-3056 hard_limit_reply_budget semantics. Adds --soft-limit-reply-budget / --hard-limit-reply-budget flags. Closes the OpenRouter→native pass-rate gap (~20 pts). ~2-3 days. Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

easel and others added 15 commits April 24, 2026 14:02

docs: replace huggingface-cli with hf

0213af0

huggingface-cli is deprecated; the current CLI is `hf`. Updates all three occurrences in README.md, the setup script log messages, and the CONTRIBUTING.md dependency table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(deps): bump llama.cpp submodule to luce-dflash merge tip

79aecbf

PR #1 on Luce-Org/llama.cpp-dflash-ggml (TQ3_0 KV cache) merged at 1823460. Bump from feature branch tip 1372283 to canonical merge commit so hub tracks luce-dflash head. Co-Authored-By: WOZCODE <contact@withwoz.com>

docs(dflash): drop Q5_K_M / Q6_K target roadmap bullet

34916da

Co-Authored-By: WOZCODE <contact@withwoz.com>

Update README.md

bf2275a

docs: add RTX 5090 / sm_120 to requirements and build comment

81701a8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

results: add TQ3_0 short-context numbers for RTX 5090 Laptop

a33988f

TQ3_0 costs ~0.7 AL and ~10 tok/s vs default at short contexts. Notes that a long-context sweep comparing TQ3 vs Q4_0 on the 5090 has not been run yet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

easel changed the title ~~build(dflash): uv setup, RTX 5090 results, long-context TQ3 sweep~~ build(dflash): uv setup, RTX 5090 results, long-context KV sweep Apr 27, 2026

easel force-pushed the feat/setup-results-uv branch 2 times, most recently from f7a2a4a to 533c025 Compare April 27, 2026 20:45

easel mentioned this pull request Apr 27, 2026

feat(server): derive model name from GGUF; default port 1236, ctx 128K #47

Closed

3 tasks

easel changed the title ~~build(dflash): uv setup, RTX 5090 results, long-context KV sweep~~ refactor(server): proper Python package with tool-calling, tests, uv setup Apr 27, 2026

easel and others added 8 commits April 28, 2026 20:25

docs(dflash): revert hf/uv-run command changes to huggingface-cli/pyt…

e6dc0cd

…hon3 These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware support PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(server): add streaming coverage for tool_choice forwarding

a066f68

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(server): auto-detect target GGUF in models/

25cb6f7

--target now defaults to the highest-version Qwen*.gguf found under models/, so dropping in a new model file (e.g. Qwen3.6) is picked up without a flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

easel and others added 2 commits April 28, 2026 20:27

chore: refresh uv.lock dependency markers

09d1179

Re-resolved by uv; updates platform/python markers on cuda-bindings and nvidia-* optional deps. No version changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

easel force-pushed the feat/setup-results-uv branch from 5b1d56e to 8b7ad4e Compare April 29, 2026 00:30

easel and others added 6 commits April 28, 2026 20:38

easel closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(server): proper Python package with tool-calling, tests, uv setup#43

refactor(server): proper Python package with tool-calling, tests, uv setup#43
easel wants to merge 31 commits into
Luce-Org:mainfrom
easel:feat/setup-results-uv

easel commented Apr 27, 2026 •

edited

Loading

Uh oh!

davide221 commented Apr 28, 2026

Uh oh!

easel commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

easel commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

davide221 commented Apr 28, 2026

Uh oh!

easel commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

easel commented Apr 27, 2026 •

edited

Loading