feat(dflash): daemon scripts improvements (GPU split, Windows, defaults)#139
feat(dflash): daemon scripts improvements (GPU split, Windows, defaults)#139javierpazo wants to merge 1 commit into
Conversation
Bundle of small daemon-script improvements that piled up while
running Lucebox in production. The cleanup is net negative
(+260 / -583): most of it removes dead branches and stale defaults.
Per CONTRIBUTING ("one concern per PR"): I tried to split this into
three PRs but the changes are entangled across the same hunks of
`server.py` / `server_tools.py` (e.g. argparse table changed in one
place to add `--target-gpu`, `--prompt-file` and the new defaults
together). I kept it as one cohesive PR. **Happy to split into
three sequential PRs if you prefer**, listed below in the order I
would split them — let me know:
A. feat(dflash): add --target-gpu / --draft-gpu daemon flags
- server.py / server_tools.py: launcher flags exposed
- Pins CUDA_VISIBLE_DEVICES per process and translates back
to native ordinals when invoking test_dflash. So
`--target-gpu=1 --draft-gpu=0` produces
CUDA_VISIBLE_DEVICES=1,0 with ordinals 0/1 inside.
- DFLASH_TARGET_GPU / DFLASH_DRAFT_GPU env vars also honoured.
- Complements PR Luce-Org#122 (open, weicj) which adds the same
selection in the Python dual-GPU bench harness; this is the
daemon side.
B. chore(dflash): Windows-friendly daemon scripts
- server.py / server_tools.py / run.py / _prefill_hook.py:
argparse plumbing supports `--prompt-file <path>` so prompts
with newlines / non-ASCII work without fragile shell quoting.
- tokenize_prompt.py / detokenize.py: minor UTF-8 path fix.
- bench_he.py / bench_llm.py: tokenizer override exposed
(line with merged PR Luce-Org#93).
C. feat(dflash): default to Qwen3.6-27B target + DFlash drafter
- server.py / server_tools.py: when neither `--target-model`
nor `--draft-model` is given, defaults pick the Qwen3.6-27B
target + DFlash drafter with Luce-Org#79-style metadata.
- Behaviour is unchanged for any invocation that already
passes the flags explicitly.
D. (already covered by PR Luce-Org#135)
- --target-cache-slots and --draft-feature-mirror are exposed
to the daemon launcher; this is a small tail of Luce-Org#135 and
not strictly part of this PR's scope. If Luce-Org#135 lands first,
these lines rebase cleanly; if this PR lands first, Luce-Org#135's
slot scheduler has the wiring it needs at the daemon level.
Validation:
- py_compile for all 8 scripts.
- End-to-end manual launch:
python dflash/scripts/server.py \
--target-gpu 1 --draft-gpu 0 \
--target-cache-slots 2 --stream-tagged \
--prompt-file prompts/long.txt
on Windows MSVC + CUDA 12.x, RTX 6000 Ada (sm_89) + RTX 4090.
Daemon comes up, both GPUs are pinned correctly, the prompt
file is read without escaping issues, slot 0 and slot 1 admit
requests independently.
- bench_he.py / bench_llm.py with a non-default tokenizer flag
matches the behaviour PR Luce-Org#93 enabled.
This is a correctness / config / UX change, not a kernel perf
change. No kernel timings claimed.
Verification vs existing community PRs:
- COMP-COMPL with Luce-Org#110 (merged, "fix Sparce attention and Qwen
loader on windows", touches `qwen3_0p6b_loader.cpp`).
- COMP-COMPL with Luce-Org#126 (merged, "support Windows in
DflashClient", touches `pflash/pflash/dflash_client.py`).
- COMP-COMPL with Luce-Org#93 (merged, "make target tokenizer
overrideable in bench_he and bench_llm").
- COMP-COMPL with Luce-Org#122 (open, weicj, "CUDA/HIP mixed backend
placement" bench harness).
- COMP-COMPL with Luce-Org#132 (open, server.py for codex tooling).
Some surface overlap on `server.py`; happy to rebase if Luce-Org#132
lands first.
- No prior art for the daemon-side --target-gpu/--draft-gpu
plumbing or for the Windows --prompt-file path at this layer.
Author: Javier Pazo <xabicasa@gmail.com>
There was a problem hiding this comment.
5 issues found across 8 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/scripts/_prefill_hook.py">
<violation number="1" location="dflash/scripts/_prefill_hook.py:150">
P1: Tab-separated `compress` output is incompatible with the daemon parser, which only matches `compress ` and parses space-separated args.</violation>
</file>
<file name="dflash/scripts/server_tools.py">
<violation number="1" location="dflash/scripts/server_tools.py:60">
P2: Hardcoding the default binary under `build/Release` can break startup for non-Windows single-config builds that produce `build/test_dflash` instead.</violation>
</file>
<file name="dflash/scripts/server.py">
<violation number="1" location="dflash/scripts/server.py:48">
P2: Hardcoding the default binary to `build/Release/...` breaks the documented default layout (`build/test_dflash`) and causes startup to exit immediately when users rely on the default `--bin` path.</violation>
<violation number="2" location="dflash/scripts/server.py:195">
P1: Removed Laguna/no-draft dispatch makes the server always require a draft and always start in DDTree mode, breaking documented non-spec targets.</violation>
<violation number="3" location="dflash/scripts/server.py:426">
P1: Sampling parameters are now accepted by the API but never forwarded to the daemon, so requests with non-default temperature/top_p/top_k silently fall back to greedy decoding.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| keep_x1000 = int(round(cfg.keep_ratio * 1000)) | ||
| daemon_stdin.write( | ||
| f"compress {path} {keep_x1000} {cfg.drafter_gguf}\n".encode("utf-8")) | ||
| f"compress\t{path}\t{keep_x1000}\t{cfg.drafter_gguf}\n".encode("utf-8")) |
There was a problem hiding this comment.
P1: Tab-separated compress output is incompatible with the daemon parser, which only matches compress and parses space-separated args.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/_prefill_hook.py, line 150:
<comment>Tab-separated `compress` output is incompatible with the daemon parser, which only matches `compress ` and parses space-separated args.</comment>
<file context>
@@ -153,23 +140,21 @@ def compress_text_via_daemon(
keep_x1000 = int(round(cfg.keep_ratio * 1000))
daemon_stdin.write(
- f"compress {path} {keep_x1000} {cfg.drafter_gguf}\n".encode("utf-8"))
+ f"compress\t{path}\t{keep_x1000}\t{cfg.drafter_gguf}\n".encode("utf-8"))
daemon_stdin.flush()
compressed_ids = _drain_until_sentinel(r_pipe)
</file context>
| f"compress\t{path}\t{keep_x1000}\t{cfg.drafter_gguf}\n".encode("utf-8")) | |
| f"compress {path} {keep_x1000} {cfg.drafter_gguf}\n".encode("utf-8")) |
| "--fast-rollback", "--ddtree", f"--ddtree-budget={budget}", | ||
| f"--max-ctx={max_ctx}", | ||
| f"--stream-fd={stream_fd_val}"] | ||
| cmd = [bin_abs, str(target), str(draft), "--daemon", |
There was a problem hiding this comment.
P1: Removed Laguna/no-draft dispatch makes the server always require a draft and always start in DDTree mode, breaking documented non-spec targets.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server.py, line 195:
<comment>Removed Laguna/no-draft dispatch makes the server always require a draft and always start in DDTree mode, breaking documented non-spec targets.</comment>
<file context>
@@ -300,25 +185,24 @@ def build_app(target: Path, draft: Path | None, bin_path: Path, budget: int, max
- "--fast-rollback", "--ddtree", f"--ddtree-budget={budget}",
- f"--max-ctx={max_ctx}",
- f"--stream-fd={stream_fd_val}"]
+ cmd = [bin_abs, str(target), str(draft), "--daemon",
+ "--fast-rollback", "--ddtree", f"--ddtree-budget={budget}",
+ f"--max-ctx={max_ctx}",
</file context>
| if snap_prep: | ||
| cmd_line += f" snap={snap_prep[1]}:{snap_prep[0]}" | ||
| return cmd_line + _samp_suffix(req) + "\n", snap_prep | ||
| return cmd_line + "\n", snap_prep |
There was a problem hiding this comment.
P1: Sampling parameters are now accepted by the API but never forwarded to the daemon, so requests with non-default temperature/top_p/top_k silently fall back to greedy decoding.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server.py, line 426:
<comment>Sampling parameters are now accepted by the API but never forwarded to the daemon, so requests with non-default temperature/top_p/top_k silently fall back to greedy decoding.</comment>
<file context>
@@ -552,7 +423,7 @@ def _build_cmd_line(req, cur_bin, cur_ids, gen_len, prefix_cache,
if snap_prep:
cmd_line += f" snap={snap_prep[1]}:{snap_prep[0]}"
- return cmd_line + _samp_suffix(req) + "\n", snap_prep
+ return cmd_line + "\n", snap_prep
def _gen_len_for(prompt_len: int, max_tokens: int) -> int:
</file context>
| )) | ||
| DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft" | ||
| DEFAULT_BIN = ROOT / "build" / "test_dflash" | ||
| DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else "")) |
There was a problem hiding this comment.
P2: Hardcoding the default binary under build/Release can break startup for non-Windows single-config builds that produce build/test_dflash instead.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server_tools.py, line 60:
<comment>Hardcoding the default binary under `build/Release` can break startup for non-Windows single-config builds that produce `build/test_dflash` instead.</comment>
<file context>
@@ -56,11 +57,23 @@
))
DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft"
-DEFAULT_BIN = ROOT / "build" / "test_dflash"
+DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
DEFAULT_BUDGET = 22
MODEL_NAME = "luce-dflash"
</file context>
| )) | ||
| DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft" | ||
| DEFAULT_BIN = ROOT / "build" / ("test_dflash" + (".exe" if sys.platform == "win32" else "")) | ||
| DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else "")) |
There was a problem hiding this comment.
P2: Hardcoding the default binary to build/Release/... breaks the documented default layout (build/test_dflash) and causes startup to exit immediately when users rely on the default --bin path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server.py, line 48:
<comment>Hardcoding the default binary to `build/Release/...` breaks the documented default layout (`build/test_dflash`) and causes startup to exit immediately when users rely on the default `--bin` path.</comment>
<file context>
@@ -46,22 +45,25 @@
))
DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft"
-DEFAULT_BIN = ROOT / "build" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
+DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
DEFAULT_BUDGET = 22
MODEL_NAME = "luce-dflash"
</file context>
| DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else "")) | |
| DEFAULT_BIN = ROOT / "build" / ("test_dflash" + (".exe" if sys.platform == "win32" else "")) |
|
Closing this and re-cutting smaller PRs against fresh main. Upstream landed ~1500 new lines in server.py / test_server.py since I opened this (PRs #132, #145), and the cubic-dev-ai review flagged real regressions in my version (sampling params not forwarded, default no-draft dispatch removed, _prefill_hook.py tab-vs-spaces). Cleanest path is to drop this branch, take what's now in main as the new baseline, and re-cut only the small surfaces that still apply (e.g. GPU split daemon flags, --prompt-file). Will reopen as separate small PRs once rebased — thanks @cubic-dev-ai for the catches. |
@javierpazo Sounds like the right call. Main has diverged enough that rebasing this branch would just create more churn than it's worth. A few things to watch for when you re-cut:
If you want, I can take another pass on the individual PRs once they're up — should be quicker this time since each one will be focused. |
feat(dflash): daemon scripts improvements (GPU split, Windows, defaults)
Bundle of small daemon-script improvements that piled up while
running Lucebox in production. The cleanup is net negative
(+260 / -583): most of it removes dead branches and stale defaults.
Per CONTRIBUTING ("one concern per PR"): I tried to split this into
three PRs but the changes are entangled across the same hunks of
server.py/server_tools.py(e.g. argparse table changed in oneplace to add
--target-gpu,--prompt-fileand the new defaultstogether). I kept it as one cohesive PR. Happy to split into
three sequential PRs if you prefer, listed below in the order I
would split them — let me know:
A. feat(dflash): add --target-gpu / --draft-gpu daemon flags
- server.py / server_tools.py: launcher flags exposed
- Pins CUDA_VISIBLE_DEVICES per process and translates back
to native ordinals when invoking test_dflash. So
--target-gpu=1 --draft-gpu=0producesCUDA_VISIBLE_DEVICES=1,0 with ordinals 0/1 inside.
- DFLASH_TARGET_GPU / DFLASH_DRAFT_GPU env vars also honoured.
- Complements PR #122 (open, weicj) which adds the same
selection in the Python dual-GPU bench harness; this is the
daemon side.
B. chore(dflash): Windows-friendly daemon scripts
- server.py / server_tools.py / run.py / _prefill_hook.py:
argparse plumbing supports
--prompt-file <path>so promptswith newlines / non-ASCII work without fragile shell quoting.
- tokenize_prompt.py / detokenize.py: minor UTF-8 path fix.
- bench_he.py / bench_llm.py: tokenizer override exposed
(line with merged PR #93).
C. feat(dflash): default to Qwen3.6-27B target + DFlash drafter
- server.py / server_tools.py: when neither
--target-modelnor
--draft-modelis given, defaults pick the Qwen3.6-27Btarget + DFlash drafter with #79-style metadata.
- Behaviour is unchanged for any invocation that already
passes the flags explicitly.
D. (already covered by PR #135)
- --target-cache-slots and --draft-feature-mirror are exposed
to the daemon launcher; this is a small tail of #135 and
not strictly part of this PR's scope. If #135 lands first,
these lines rebase cleanly; if this PR lands first, #135's
slot scheduler has the wiring it needs at the daemon level.
Validation:
python dflash/scripts/server.py
--target-gpu 1 --draft-gpu 0
--target-cache-slots 2 --stream-tagged
--prompt-file prompts/long.txt
on Windows MSVC + CUDA 12.x, RTX 6000 Ada (sm_89) + RTX 4090.
Daemon comes up, both GPUs are pinned correctly, the prompt
file is read without escaping issues, slot 0 and slot 1 admit
requests independently.
matches the behaviour PR bench: make target tokenizer overrideable in bench_he and bench_llm #93 enabled.
This is a correctness / config / UX change, not a kernel perf
change. No kernel timings claimed.
Verification vs existing community PRs:
loader on windows", touches
qwen3_0p6b_loader.cpp).DflashClient", touches
pflash/pflash/dflash_client.py).overrideable in bench_he and bench_llm").
placement" bench harness).
Some surface overlap on
server.py; happy to rebase if make server.py capabale to run codex #132lands first.
plumbing or for the Windows --prompt-file path at this layer.
Author: Javier Pazo xabicasa@gmail.com