Skip to content

feat(server): add mixed-backend DFlash disk prefix cache for target layer split#352

Open
weicj wants to merge 6 commits into
Luce-Org:mainfrom
weicj:feat-mixed-backend-disk-prefix-cache
Open

feat(server): add mixed-backend DFlash disk prefix cache for target layer split#352
weicj wants to merge 6 commits into
Luce-Org:mainfrom
weicj:feat-mixed-backend-disk-prefix-cache

Conversation

@weicj

@weicj weicj commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR is the mixed-backend follow-up to #325. #325 restored disk prefix cache for same-backend target layer split; this PR adds the missing remote-shard snapshot export/import path for CUDA/HIP mixed-backend target split. With this change, placements such as --target-devices cuda:0,hip:0,hip:1 can be used together with --kv-cache-dir, and a restarted server can restore the split target state from disk prefix cache.

Previously, mixed-backend target split could not safely enable disk cache because the target prefix state does not live only in the parent process. The parent can write its local shard into the disk snapshot, but the remote HIP/CUDA shards live inside the backend IPC daemon. Without remote shard snapshot IPC, disk restore could only recover the parent-local part and would leave the split target state incomplete.

This PR makes remote shard snapshots an explicit IPC operation. On save, the parent asks the remote target shard daemon to export its snapshot tensors, then writes them into the same disk snapshot as the local shard tensors, snap_prefill_logits, and the DFlash feature mirror. On load, the parent splits remote shard tensors back out of the disk snapshot and imports them into the remote daemon. DFlash target restore and feature-mirror restore stay in the same snapshot, so speculative decode can continue after disk restore.

Changes

  • Adds Qwen35 remote target shard snapshot IPC:
    • adds prefix_snapshot_export to export shard-local snapshot tensors and logits from the remote target shard daemon;
    • adds prefix_snapshot_import to rebuild disk-loaded remote tensors into a remote daemon prefix snapshot slot;
    • includes shard id, tensor name, dtype, shape, and payload size in each snapshot tensor header, with import-time validation that shape/type match the payload size.
  • Extends the Qwen35 layer-split disk snapshot:
    • same-backend still stores local ls<shard>_<tensor-name> tensors;
    • mixed-backend additionally stores remote shard tensors after the local shard index range;
    • restore adopts local shards back into the parent adapter and imports remote shard tensors back into the backend IPC daemon;
    • keeps snap_prefill_logits, dflash_feature_meta, and dflash_feature_data, so DFlash disk restore does not degrade into a target-only partial state.
  • Updates server placement validation to allow --kv-cache-dir with mixed-backend --target-devices; remote target shard IPC still has to be provided explicitly.

Notes

  • Local runtime validation passed on Tesla P4 CUDA + dual Pro VII HIP with Qwen3.6-27B Q4 target, the Qwen3.6 DFlash draft, and --target-devices cuda:0,hip:0,hip:1 --target-layer-split 0.08,0.46,0.46. The first server process saved disk prefix cache; after a cold restart, the second process hit disk cache and logged [target-split] adopted disk snapshot slot=63 local_shards=1 remote_shards=2 pos=10, disk_hit=true, restore=true, and DFlash speculative decode with accepted draft tokens.
  • Remote runtime validation also passed on RTX 3090 CUDA + Radeon 8060S Strix Halo HIP/gfx1151 with Qwen3.6-27B Q4 target, the Qwen3.6 DFlash draft, and --target-devices cuda:0,hip:0 --target-layer-split 0.5,0.5. The second cold start scanned the disk cache and logged [target-split] adopted disk snapshot slot=63 local_shards=1 remote_shards=1 pos=10, disk_hit=true, restore=true; DFlash continued speculative decode with accepted draft tokens after restore.

@weicj weicj force-pushed the feat-mixed-backend-disk-prefix-cache branch from 3309ee5 to 6a5a84a Compare June 8, 2026 07:32
@weicj weicj marked this pull request as ready for review June 8, 2026 10:30

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_target_shard_ipc.cpp
Comment thread server/src/qwen35/qwen35_target_shard_ipc_daemon.cpp

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_target_shard_ipc_daemon.cpp Outdated
@weicj weicj force-pushed the feat-mixed-backend-disk-prefix-cache branch from 42c2ff1 to 4e22769 Compare June 8, 2026 16:44
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 8, 2026
… disk prefix cache for target layer split

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant