Skip to content

feat(dflash): daemon scripts improvements (GPU split, Windows, defaults)#139

Closed
javierpazo wants to merge 1 commit into
Luce-Org:mainfrom
javierpazo:xabicasa/dflash-daemon-scripts-improvements
Closed

feat(dflash): daemon scripts improvements (GPU split, Windows, defaults)#139
javierpazo wants to merge 1 commit into
Luce-Org:mainfrom
javierpazo:xabicasa/dflash-daemon-scripts-improvements

Conversation

@javierpazo

Copy link
Copy Markdown
Contributor

feat(dflash): daemon scripts improvements (GPU split, Windows, defaults)

Bundle of small daemon-script improvements that piled up while
running Lucebox in production. The cleanup is net negative
(+260 / -583): most of it removes dead branches and stale defaults.

Per CONTRIBUTING ("one concern per PR"): I tried to split this into
three PRs but the changes are entangled across the same hunks of
server.py / server_tools.py (e.g. argparse table changed in one
place to add --target-gpu, --prompt-file and the new defaults
together). I kept it as one cohesive PR. Happy to split into
three sequential PRs if you prefer
, listed below in the order I
would split them — let me know:

A. feat(dflash): add --target-gpu / --draft-gpu daemon flags
- server.py / server_tools.py: launcher flags exposed
- Pins CUDA_VISIBLE_DEVICES per process and translates back
to native ordinals when invoking test_dflash. So
--target-gpu=1 --draft-gpu=0 produces
CUDA_VISIBLE_DEVICES=1,0 with ordinals 0/1 inside.
- DFLASH_TARGET_GPU / DFLASH_DRAFT_GPU env vars also honoured.
- Complements PR #122 (open, weicj) which adds the same
selection in the Python dual-GPU bench harness; this is the
daemon side.

B. chore(dflash): Windows-friendly daemon scripts
- server.py / server_tools.py / run.py / _prefill_hook.py:
argparse plumbing supports --prompt-file <path> so prompts
with newlines / non-ASCII work without fragile shell quoting.
- tokenize_prompt.py / detokenize.py: minor UTF-8 path fix.
- bench_he.py / bench_llm.py: tokenizer override exposed
(line with merged PR #93).

C. feat(dflash): default to Qwen3.6-27B target + DFlash drafter
- server.py / server_tools.py: when neither --target-model
nor --draft-model is given, defaults pick the Qwen3.6-27B
target + DFlash drafter with #79-style metadata.
- Behaviour is unchanged for any invocation that already
passes the flags explicitly.

D. (already covered by PR #135)
- --target-cache-slots and --draft-feature-mirror are exposed
to the daemon launcher; this is a small tail of #135 and
not strictly part of this PR's scope. If #135 lands first,
these lines rebase cleanly; if this PR lands first, #135's
slot scheduler has the wiring it needs at the daemon level.

Validation:

  • py_compile for all 8 scripts.
  • End-to-end manual launch:
    python dflash/scripts/server.py
    --target-gpu 1 --draft-gpu 0
    --target-cache-slots 2 --stream-tagged
    --prompt-file prompts/long.txt
    on Windows MSVC + CUDA 12.x, RTX 6000 Ada (sm_89) + RTX 4090.
    Daemon comes up, both GPUs are pinned correctly, the prompt
    file is read without escaping issues, slot 0 and slot 1 admit
    requests independently.
  • bench_he.py / bench_llm.py with a non-default tokenizer flag
    matches the behaviour PR bench: make target tokenizer overrideable in bench_he and bench_llm #93 enabled.

This is a correctness / config / UX change, not a kernel perf
change. No kernel timings claimed.

Verification vs existing community PRs:

Author: Javier Pazo xabicasa@gmail.com

Bundle of small daemon-script improvements that piled up while
running Lucebox in production. The cleanup is net negative
(+260 / -583): most of it removes dead branches and stale defaults.

Per CONTRIBUTING ("one concern per PR"): I tried to split this into
three PRs but the changes are entangled across the same hunks of
`server.py` / `server_tools.py` (e.g. argparse table changed in one
place to add `--target-gpu`, `--prompt-file` and the new defaults
together). I kept it as one cohesive PR. **Happy to split into
three sequential PRs if you prefer**, listed below in the order I
would split them — let me know:

  A. feat(dflash): add --target-gpu / --draft-gpu daemon flags
     - server.py / server_tools.py: launcher flags exposed
     - Pins CUDA_VISIBLE_DEVICES per process and translates back
       to native ordinals when invoking test_dflash. So
       `--target-gpu=1 --draft-gpu=0` produces
       CUDA_VISIBLE_DEVICES=1,0 with ordinals 0/1 inside.
     - DFLASH_TARGET_GPU / DFLASH_DRAFT_GPU env vars also honoured.
     - Complements PR Luce-Org#122 (open, weicj) which adds the same
       selection in the Python dual-GPU bench harness; this is the
       daemon side.

  B. chore(dflash): Windows-friendly daemon scripts
     - server.py / server_tools.py / run.py / _prefill_hook.py:
       argparse plumbing supports `--prompt-file <path>` so prompts
       with newlines / non-ASCII work without fragile shell quoting.
     - tokenize_prompt.py / detokenize.py: minor UTF-8 path fix.
     - bench_he.py / bench_llm.py: tokenizer override exposed
       (line with merged PR Luce-Org#93).

  C. feat(dflash): default to Qwen3.6-27B target + DFlash drafter
     - server.py / server_tools.py: when neither `--target-model`
       nor `--draft-model` is given, defaults pick the Qwen3.6-27B
       target + DFlash drafter with Luce-Org#79-style metadata.
     - Behaviour is unchanged for any invocation that already
       passes the flags explicitly.

  D. (already covered by PR Luce-Org#135)
     - --target-cache-slots and --draft-feature-mirror are exposed
       to the daemon launcher; this is a small tail of Luce-Org#135 and
       not strictly part of this PR's scope. If Luce-Org#135 lands first,
       these lines rebase cleanly; if this PR lands first, Luce-Org#135's
       slot scheduler has the wiring it needs at the daemon level.

Validation:
  - py_compile for all 8 scripts.
  - End-to-end manual launch:
      python dflash/scripts/server.py \
        --target-gpu 1 --draft-gpu 0 \
        --target-cache-slots 2 --stream-tagged \
        --prompt-file prompts/long.txt
    on Windows MSVC + CUDA 12.x, RTX 6000 Ada (sm_89) + RTX 4090.
    Daemon comes up, both GPUs are pinned correctly, the prompt
    file is read without escaping issues, slot 0 and slot 1 admit
    requests independently.
  - bench_he.py / bench_llm.py with a non-default tokenizer flag
    matches the behaviour PR Luce-Org#93 enabled.

This is a correctness / config / UX change, not a kernel perf
change. No kernel timings claimed.

Verification vs existing community PRs:
  - COMP-COMPL with Luce-Org#110 (merged, "fix Sparce attention and Qwen
    loader on windows", touches `qwen3_0p6b_loader.cpp`).
  - COMP-COMPL with Luce-Org#126 (merged, "support Windows in
    DflashClient", touches `pflash/pflash/dflash_client.py`).
  - COMP-COMPL with Luce-Org#93 (merged, "make target tokenizer
    overrideable in bench_he and bench_llm").
  - COMP-COMPL with Luce-Org#122 (open, weicj, "CUDA/HIP mixed backend
    placement" bench harness).
  - COMP-COMPL with Luce-Org#132 (open, server.py for codex tooling).
    Some surface overlap on `server.py`; happy to rebase if Luce-Org#132
    lands first.
  - No prior art for the daemon-side --target-gpu/--draft-gpu
    plumbing or for the Windows --prompt-file path at this layer.

Author: Javier Pazo <xabicasa@gmail.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 8 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/_prefill_hook.py">

<violation number="1" location="dflash/scripts/_prefill_hook.py:150">
P1: Tab-separated `compress` output is incompatible with the daemon parser, which only matches `compress ` and parses space-separated args.</violation>
</file>

<file name="dflash/scripts/server_tools.py">

<violation number="1" location="dflash/scripts/server_tools.py:60">
P2: Hardcoding the default binary under `build/Release` can break startup for non-Windows single-config builds that produce `build/test_dflash` instead.</violation>
</file>

<file name="dflash/scripts/server.py">

<violation number="1" location="dflash/scripts/server.py:48">
P2: Hardcoding the default binary to `build/Release/...` breaks the documented default layout (`build/test_dflash`) and causes startup to exit immediately when users rely on the default `--bin` path.</violation>

<violation number="2" location="dflash/scripts/server.py:195">
P1: Removed Laguna/no-draft dispatch makes the server always require a draft and always start in DDTree mode, breaking documented non-spec targets.</violation>

<violation number="3" location="dflash/scripts/server.py:426">
P1: Sampling parameters are now accepted by the API but never forwarded to the daemon, so requests with non-default temperature/top_p/top_k silently fall back to greedy decoding.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

keep_x1000 = int(round(cfg.keep_ratio * 1000))
daemon_stdin.write(
f"compress {path} {keep_x1000} {cfg.drafter_gguf}\n".encode("utf-8"))
f"compress\t{path}\t{keep_x1000}\t{cfg.drafter_gguf}\n".encode("utf-8"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Tab-separated compress output is incompatible with the daemon parser, which only matches compress and parses space-separated args.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/_prefill_hook.py, line 150:

<comment>Tab-separated `compress` output is incompatible with the daemon parser, which only matches `compress ` and parses space-separated args.</comment>

<file context>
@@ -153,23 +140,21 @@ def compress_text_via_daemon(
         keep_x1000 = int(round(cfg.keep_ratio * 1000))
         daemon_stdin.write(
-            f"compress {path} {keep_x1000} {cfg.drafter_gguf}\n".encode("utf-8"))
+            f"compress\t{path}\t{keep_x1000}\t{cfg.drafter_gguf}\n".encode("utf-8"))
         daemon_stdin.flush()
         compressed_ids = _drain_until_sentinel(r_pipe)
</file context>
Suggested change
f"compress\t{path}\t{keep_x1000}\t{cfg.drafter_gguf}\n".encode("utf-8"))
f"compress {path} {keep_x1000} {cfg.drafter_gguf}\n".encode("utf-8"))

Comment thread dflash/scripts/server.py
"--fast-rollback", "--ddtree", f"--ddtree-budget={budget}",
f"--max-ctx={max_ctx}",
f"--stream-fd={stream_fd_val}"]
cmd = [bin_abs, str(target), str(draft), "--daemon",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Removed Laguna/no-draft dispatch makes the server always require a draft and always start in DDTree mode, breaking documented non-spec targets.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server.py, line 195:

<comment>Removed Laguna/no-draft dispatch makes the server always require a draft and always start in DDTree mode, breaking documented non-spec targets.</comment>

<file context>
@@ -300,25 +185,24 @@ def build_app(target: Path, draft: Path | None, bin_path: Path, budget: int, max
-               "--fast-rollback", "--ddtree", f"--ddtree-budget={budget}",
-               f"--max-ctx={max_ctx}",
-               f"--stream-fd={stream_fd_val}"]
+    cmd = [bin_abs, str(target), str(draft), "--daemon",
+           "--fast-rollback", "--ddtree", f"--ddtree-budget={budget}",
+           f"--max-ctx={max_ctx}",
</file context>

Comment thread dflash/scripts/server.py
if snap_prep:
cmd_line += f" snap={snap_prep[1]}:{snap_prep[0]}"
return cmd_line + _samp_suffix(req) + "\n", snap_prep
return cmd_line + "\n", snap_prep

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Sampling parameters are now accepted by the API but never forwarded to the daemon, so requests with non-default temperature/top_p/top_k silently fall back to greedy decoding.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server.py, line 426:

<comment>Sampling parameters are now accepted by the API but never forwarded to the daemon, so requests with non-default temperature/top_p/top_k silently fall back to greedy decoding.</comment>

<file context>
@@ -552,7 +423,7 @@ def _build_cmd_line(req, cur_bin, cur_ids, gen_len, prefix_cache,
             if snap_prep:
                 cmd_line += f" snap={snap_prep[1]}:{snap_prep[0]}"
-            return cmd_line + _samp_suffix(req) + "\n", snap_prep
+            return cmd_line + "\n", snap_prep
 
     def _gen_len_for(prompt_len: int, max_tokens: int) -> int:
</file context>

))
DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft"
DEFAULT_BIN = ROOT / "build" / "test_dflash"
DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hardcoding the default binary under build/Release can break startup for non-Windows single-config builds that produce build/test_dflash instead.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server_tools.py, line 60:

<comment>Hardcoding the default binary under `build/Release` can break startup for non-Windows single-config builds that produce `build/test_dflash` instead.</comment>

<file context>
@@ -56,11 +57,23 @@
 ))
 DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft"
-DEFAULT_BIN = ROOT / "build" / "test_dflash"
+DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
 DEFAULT_BUDGET = 22
 MODEL_NAME = "luce-dflash"
</file context>

Comment thread dflash/scripts/server.py
))
DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft"
DEFAULT_BIN = ROOT / "build" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hardcoding the default binary to build/Release/... breaks the documented default layout (build/test_dflash) and causes startup to exit immediately when users rely on the default --bin path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/server.py, line 48:

<comment>Hardcoding the default binary to `build/Release/...` breaks the documented default layout (`build/test_dflash`) and causes startup to exit immediately when users rely on the default `--bin` path.</comment>

<file context>
@@ -46,22 +45,25 @@
 ))
 DEFAULT_DRAFT_ROOT = ROOT / "models" / "draft"
-DEFAULT_BIN = ROOT / "build" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
+DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
 DEFAULT_BUDGET = 22
 MODEL_NAME = "luce-dflash"
</file context>
Suggested change
DEFAULT_BIN = ROOT / "build" / "Release" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))
DEFAULT_BIN = ROOT / "build" / ("test_dflash" + (".exe" if sys.platform == "win32" else ""))

@javierpazo

Copy link
Copy Markdown
Contributor Author

Closing this and re-cutting smaller PRs against fresh main. Upstream landed ~1500 new lines in server.py / test_server.py since I opened this (PRs #132, #145), and the cubic-dev-ai review flagged real regressions in my version (sampling params not forwarded, default no-draft dispatch removed, _prefill_hook.py tab-vs-spaces). Cleanest path is to drop this branch, take what's now in main as the new baseline, and re-cut only the small surfaces that still apply (e.g. GPU split daemon flags, --prompt-file). Will reopen as separate small PRs once rebased — thanks @cubic-dev-ai for the catches.

@javierpazo javierpazo closed this May 10, 2026
@cubic-dev-ai

cubic-dev-ai Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Closing this and re-cutting smaller PRs against fresh main. Upstream landed ~1500 new lines in server.py / test_server.py since I opened this (PRs #132, #145), and the cubic-dev-ai review flagged real regressions in my version (sampling params not forwarded, default no-draft dispatch removed, _prefill_hook.py tab-vs-spaces). Cleanest path is to drop this branch, take what's now in main as the new baseline, and re-cut only the small surfaces that still apply (e.g. GPU split daemon flags, --prompt-file). Will reopen as separate small PRs once rebased — thanks @cubic-dev-ai for the catches.

@javierpazo Sounds like the right call. Main has diverged enough that rebasing this branch would just create more churn than it's worth.

A few things to watch for when you re-cut:

  • Sampling forwarding — the current PR stripped _samp_suffix and marked temperature/top_p/top_k as "accepted, ignored" in the request models. If upstream main now supports sampling in the daemon protocol, make sure the new PRs don't accidentally drop that wiring again.

  • Laguna/no-draft path — main's server.py likely still handles the case where draft can be None (archs without spec-decode). The PR here removed that branch entirely and required a draft always. Your re-cut for the GPU split flags should preserve whatever dispatch logic main has now.

  • _prefill_hook.py tabs — the \t separators in the compress command line were a minor thing but easy to slip back in. Sticking with spaces keeps it consistent with the rest of the file.

  • --prompt-file for Windows is probably the cleanest standalone PR to start with — small surface, no model dispatch concerns, and it sounds like upstream main doesn't have it yet.

If you want, I can take another pass on the individual PRs once they're up — should be quicker this time since each one will be focused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant