refactor(server): proper Python package with tool-calling, tests, uv setup#43
Closed
easel wants to merge 31 commits into
Closed
refactor(server): proper Python package with tool-calling, tests, uv setup#43easel wants to merge 31 commits into
easel wants to merge 31 commits into
Conversation
huggingface-cli is deprecated; the current CLI is `hf`. Updates all three occurrences in README.md, the setup script log messages, and the CONTRIBUTING.md dependency table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds pyproject.toml declaring all script dependencies (transformers, numpy, gguf, fastapi, uvicorn, jinja2, pytest, httpx) with torch gated behind an [oracle] optional. Use `uv sync` to install. Replaces pipx with uv in setup_system.sh: installs uv via the official astral.sh installer for $SUDO_USER, then uses `uv tool install` for hf. Updates README quick-start server block and CONTRIBUTING.md to use `uv sync` / `uv run` instead of manual venv + pip. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sm_86 fails on non-Ampere GPUs (e.g. RTX 5090 is sm_120). Using CMAKE_CUDA_ARCHITECTURES=native lets CMake auto-detect the installed GPU so the build works on any supported card. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds per-prompt and summary results from RTX 5090 Laptop (sm_120, CUDA 13.2): HumanEval 87.30 tok/s 3.64×, GSM8K 70.92 tok/s 2.98×, Math500 72.97 tok/s 3.07×. Lower absolute AR (~24 vs ~38 tok/s) due to laptop power limits; speedup ratio holds and HumanEval improves to 3.64× at AL 8.49. Also adds `datasets` to pyproject.toml (required by bench_llm.py) and updates the Reproducibility section to use `uv run`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire TQ3_0 (TurboQuant 3.5bpv) into dflash's custom graph builder, enabling 22% KV memory reduction vs Q4_0 with identical decode speed. Changes: - Add DFLASH27B_KV_TQ3 env override (--kv-tq3 flag in run.py) - Store kv_k_type in TargetCache for downstream use - Pad cache allocation to 256-aligned for TQ3_0 FA stride requirements - Apply FWHT rotation to Q before FA, un-rotate output from V space - Pad kv_len_padded to 256-aligned for TQ3_0 (FA vec kernel requirement) - Update test_dflash stride padding for TQ3_0 Requires llama.cpp submodule with TQ3_0 support (see Luce-Org/llama.cpp PR #1)
PR #1 on Luce-Org/llama.cpp-dflash-ggml (TQ3_0 KV cache) merged at 1823460. Bump from feature branch tip 1372283 to canonical merge commit so hub tracks luce-dflash head. Co-Authored-By: WOZCODE <contact@withwoz.com>
- server.py / server_tools.py auto-enable DFLASH27B_KV_TQ3 (was Q4) for max_ctx > 6144; explicit DFLASH27B_KV_Q4=1 still wins. - Quickstart 128K example and bench_daemon docstring switched to TQ3. - README narrative bumps long-context ceiling to 256K per PR #1 on llama.cpp-dflash-ggml (TQ3_0 = 3.5 bpv, 22% smaller than Q4_0). - Remove "TurboQuant KV cache" from Contributing roadmap (shipped). Behavior change: servers auto-enable path previously defaulted to Q4_0 and now defaults to TQ3_0 above the 6144 context threshold. Co-Authored-By: WOZCODE <contact@withwoz.com>
TQ3_0 KV cache raises the RTX 3090 ceiling to 262144 tokens per PR #1 on Luce-Org/llama.cpp-dflash-ggml. The README quickstart example still showed `up to 131072` and was labeled "128K context mode:". Co-Authored-By: WOZCODE <contact@withwoz.com>
Persistent daemon shipped in Luce-Org#7 (feat(dflash): implement persistent daemon mode for server.py). The bullet under Scope and limits was still claiming per-request respawn and ~10 s first-token latency. Co-Authored-By: WOZCODE <contact@withwoz.com>
Co-Authored-By: WOZCODE <contact@withwoz.com>
…ADME TQ3_0 (3.5 bpv) raises the 24 GB RTX 3090 ceiling from 128K to 256K per PR #1 on Luce-Org/llama.cpp-dflash-ggml. Keep the 134.78 tok/s Q4_0 benchmark as a historical reference point at 128K. Co-Authored-By: WOZCODE <contact@withwoz.com>
Experiments and benchmarks remain RTX 3090 (Ampere). README now documents that dflash builds on RTX 5090 (sm_120, CUDA 12.8+) and GB10 / DGX Spark / Jetson Thor (sm_121, CUDA 12.9+) with no source changes, since dflash/CMakeLists.txt already auto-adds those archs. - Drop -DCMAKE_CUDA_ARCHITECTURES=86 from dflash quickstart so CMake auto-selection actually kicks in for newer GPUs. - Add 'Running on other GPUs' subsection with compat table, verify snippet, DGX Spark quick start, and callouts for what will NOT auto-port (DDTree budget=22, Q4_0 KV ring, perf numbers). - Rewrite Requirements with per-arch minimum CUDA and a megakernel porting note (edit sm_XX + NUM_BLOCKS in setup.py, one block per SM).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TQ3_0 costs ~0.7 AL and ~10 tok/s vs default at short contexts. Notes that a long-context sweep comparing TQ3 vs Q4_0 on the 5090 has not been run yet. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
f7a2a4a to
533c025
Compare
3 tasks
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Apr 27, 2026
…hon3 These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware support PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
|
@easel thanks for the contribution! I want to pinpoint that as the number of projects growth we will centralize the server management outside of any specific project folder in the next weeks. |
CUDA 13.2 resolves CMAKE_CUDA_ARCHITECTURES=native to sm_120a on Blackwell machines, but consumer RTX 5090 (SM 12.0) lacks FP4 tensor cores and faults with CUDA_ERROR_ILLEGAL_INSTRUCTION when running sm_120a kernels. Fix: query nvidia-smi at cmake configure time to get the exact compute capability (e.g. "12.0" → "120") and set CMAKE_CUDA_ARCHITECTURES before add_subdirectory so ggml-cuda inherits the correct value. Also sets GGML_CUDA_BLACKWELL_CONSUMER=ON for SM 12.x targets, which (in the updated submodule) skips ggml's 12X→12Xa arch replacement and excludes mmq FP4 kernel instances that require sm_120a. Falls back to the compiler-version-based arch list when nvidia-smi is absent (CI, headless) so builds without a GPU still work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p script Add RESULTS.md with Q4_0 vs TQ3_0 long-context benchmarks (up to 256K tokens) on RTX 5090 Laptop GPU, plus bench_long_ctx.py sweep script, and README updates noting hardware requirements and benchmark methodology. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hon3 These doc changes belong in the server/uv PR (Luce-Org#43), not in the hardware support PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…setup Replace ad-hoc scripts/server*.py with a proper dflash Python package: - src/dflash/server/ with modular parsing, schemas, and OpenAI-compatible API - tests/ with pytest coverage for parsing and server endpoints - pyproject.toml with build-system, entry point (dflash-server), and uv.lock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 'none': suppresses tools from apply_chat_template so model never sees them
- 'required' / named-function dict: forwarded as tool_choice kwarg to the
Qwen3.x chat template, which adds forcing instructions to the prompt
- any other value: returns 400 {error:{code:'unsupported_parameter',param:'tool_choice'}}
instead of silently ignoring it and burning the full token budget
Fixes the conformance gap reported in lucebox-tool-support-2026-04-27.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ints
Streams 8 typical code-agent prompts through /v1/chat/completions and reports
TTFT, decode tok/s, and completion token count per prompt. Configurable URL
so the same script can compare dflash vs LM Studio or any other endpoint.
uv run scripts/bench_server.py # dflash :1236
uv run scripts/bench_server.py --url http://host:1234 # LM Studio
uv run scripts/bench_server.py --repeat 3 --n-gen 256 # averaged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
--target now defaults to the highest-version Qwen*.gguf found under models/, so dropping in a new model file (e.g. Qwen3.6) is picked up without a flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-resolved by uv; updates platform/python markers on cuda-bindings and nvidia-* optional deps. No version changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Track overall_tok_s = n_tok / total_wall alongside decode_tok_s - TTFT now fires on reasoning_content delta (Qwen3 thinking prefix) instead of only on content, fixing 0 tok/s reports in thinking mode - Remove enable_thinking:false override so both servers run equivalently Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5b1d56e to
8b7ad4e
Compare
bench_server.py was 8 hand-written ~100-token code-agent prompts — useful for short-prompt decode tok/s but not representative of real agentic workloads (5K–12K token prompts, prefill-dominant). --replay loads a ddx-style sessions.jsonl (each row has a top-level `prompt` field) and reissues each as a single streaming chat completion against the configured server. Output adds a p_chars column and per-bucket aggregation (small <2K / medium 2K–8K / large >8K) so prefill-vs-decode trends are visible. Counterpart ddx beads filed for trace-schema gaps: - ddx-b3b9501a: latency_ms field is mislabeled (= cumulative elapsed_ms) - ddx-969c5500: agent-logs schema is metadata-only; needs verbose mode Counterpart agent beads filed: - agent-a8915e01: RotatingKVCache Quantization NYI not classified - agent-e5f0b894: reasoning-stall errors should be structured Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PError mid-stream The 8b7ad4e refactor added overall_tok_s to ProbResult but missed the URLError handler, so any mid-stream provider failure crashed with TypeError. Also catch generic Exception so 502 / connection-reset errors become a row-level error rather than a fatal crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ctor) PR Luce-Org#13 (upstream) lowered the daemon default from 131072 → 16384 to avoid the FA-stride / VRAM-cliff trap documented in issue Luce-Org#10. The package refactor in 1a86289 inadvertently restored 131072 when moving the CLI into src/dflash/server/__init__.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pt replay The previous --replay mode treated each session's prompt as a single one-shot LLM call. That's wrong: real agentic sessions accumulate tool results turn-over-turn, growing per-call input from ~5K chars at turn 1 to 60K-300K chars by the final turn. A first-turn-only replay understates the real workload by an order of magnitude. Synthetic tool results were the wrong fix — fidelity matters here. This commit drops the synthetic path entirely and adds --transcript: load a recorded Claude Code session jsonl from ~/.claude/projects/<workspace>/<uuid>.jsonl, walk it call by call, send the exact message prefix that was originally sent at each point, and measure TTFT + decode tok/s. Tool I/O comes from the recording so every server under test sees an identical per-call input distribution. uv run scripts/bench_server.py --transcript path/to/session.jsonl Repeat --transcript to bench multiple sessions. --max-calls caps work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce the text-only agent-code-text serving profile with 48K context, prompt admission limits, reserved generation headroom, safe default env wiring, and KV cache type passthroughs. Reject multimodal content for the profile and keep prefix cache disabled. Document the deferred prefix-cache design and implementation plan, and cover the profile behavior with server tests.
# Conflicts: # CONTRIBUTING.md # README.md # dflash/README.md # dflash/RESULTS.md # dflash/pyproject.toml # dflash/scripts/run.py # dflash/scripts/server.py # dflash/scripts/server_tools.py # dflash/scripts/setup_system.sh # dflash/src/internal.h # dflash/src/qwen35_target_graph.cpp # dflash/test/test_dflash.cpp
Collaborator
Author
LMK how I can help out here. I've got some notes in a README.md regarding the packages to install and have this pyproject.toml as well. I'd definitely recommend going with uv. I'm going to close this out for now to avoid the clutter but happen to re-wire it. |
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 22, 2026
Closes two of the three feature gaps between dflash/scripts/server.py
(Python, reference impl) and dflash/src/server/dflash_server (C++,
production runtime), as outlined in the migration plan.
## /props introspection
Wire /props in http_server.cpp::handle_client. JSON shape matches
server.py:1221-1312 key-for-key so cross-server consumers (autotune
sweeps, dashboards, lucebox profile/snapshot) see a stable contract.
ServerConfig grows the introspection inputs (arch, model_path,
draft_path, kv_cache_k/v, runtime_backend, fa_window, ddtree_budget,
speculative_enabled, target_sharding, tokenizer_id) populated by
server_main before HttpServer construction.
PrefixCache and ToolMemory gain stats() / full_stats() accessors with
the same lockless-snapshot semantics the Python impl documents — a
mutation under daemon_lock can tear in_use vs lifetime_hits across the
read pair; acceptable for /props.
PROPS_SCHEMA = 1 (matches Python's current schema). Bump only on
breaking changes: field renamed, removed, or semantics-changed.
## /v1/messages/count_tokens
Reuses the Anthropic message-parsing path from /v1/messages, short-
circuits after tokenization with {"input_tokens": N}. <1s on a hot
server (no generation).
## Tests
Integration coverage in dflash/scripts/test_server_integration.py:
- TestProps: top-level keys, server block shape, speculative_mode
consistency, runtime backend, arch-gated capabilities, API endpoint
registry, prefix_cache/tool_replay shapes.
- TestCountTokens: simple count, scaling with message length, system
block handling, <1s budget assertion.
Thinking-budget surface (--think-max-tokens flag, finish_details,
thinking_opt_in tracking) is the second PR in this migration —
intentionally separated so the algorithm-design review (Level 1/2/3
fidelity question) doesn't block /props from landing.
Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean with
CUDA layers cached.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
May 22, 2026
…_details)
Closes the third feature gap in the Python→C++ server migration: the
thinking-budget wire surface so consumers (lucebox bench, dashboards)
can opt in to the envelope and see the close-info block on responses.
## What ships
- `--think-max-tokens` CLI flag (default 10000): cap on the phase-1
reasoning generation when a request opts in.
- `--default-max-tokens` CLI flag (default 16000, matches
antirez/ds4 ds4_eval.c's `.max_tokens`): combined cap used when a
request omits max_tokens.
- `thinking: {type: "enabled"}` is now tracked as a presence-opt-in
(`ParsedRequest.thinking_opt_in`) so the server can condition
response shape on it.
- `finish_details` block on /v1/chat/completions responses when
thinking is opted in. Fields match docs/specs/thinking-budget.md:43-58
and server.py:2272 (close_kind, thinking_tokens, content_tokens,
total_tokens).
## What's deferred (intentional)
The phase-1/phase-2 reprompt MECHANISM (port of server.py:2141-2196)
is not in this PR. close_kind always reports "natural" for now —
the C++ server doesn't yet force-close on hard_limit. The Python
server's existing behavior continues to be the reference impl.
Why ship the surface first:
- Unblocks consumers (the lucebox bench can stop sending custom
envelope fields and just send standard OpenAI + this opt-in).
- Lets us land /props first without algorithm-design review on this PR.
- The Level 1 vs Level 2 vs Level 3 fidelity question (phase-1/phase-2
reprompt vs true mid-stream force-close vs full eval_think_close_info
reporting) is a separate design conversation — the wire shape stays
the same regardless of which Level lands.
## Tests
Integration coverage in test_server_integration.py:
- TestThinkingBudget.test_finish_details_present_when_thinking_opted_in:
asserts the block is emitted with valid types when
`thinking:{type:enabled}` is sent; invariant
`thinking_tokens + content_tokens == total_tokens`.
- TestThinkingBudget.test_finish_details_absent_when_thinking_not_opted_in:
asserts the block is NOT emitted on a plain chat request, matching
Python's server.py:2271 conditional.
## Follow-ups (separate PRs)
- Level 1 phase-1/phase-2 reprompt: port server.py:2141-2196 into
worker_loop. Sets close_kind="hard" when phase-1 didn't emit
</think> within think_max_tokens. ~1 day.
- Level 2 true in-process force-close: backend-level sampler hook,
matches ds4_eval.c:3027-3056 hard_limit_reply_budget semantics.
Adds --soft-limit-reply-budget / --hard-limit-reply-budget flags.
Closes the OpenRouter→native pass-rate gap (~20 pts). ~2-3 days.
Validated via docker buildx bake cuda12 — image Luce-Org#43 built clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/server*.pywith a properdflashPython package undersrc/dflash/server/pyproject.tomlwith[build-system],dflash-serverentry point, anduv.lockscripts/server.py,scripts/server_tools.py,scripts/test_server.pyBase: PR #48 (
fix/consumer-blackwell-auto-detect) — RTX 5090 hardware supportTest plan
uv sync—dflash-serverentry point installs without warningspytest dflash/tests/passesdflash-server --model <gguf>starts and serves/v1/chat/completions🤖 Generated with Claude Code