feat(dflash): native multi-request scheduler with batched target step#135
Open
javierpazo wants to merge 1 commit into
Open
Conversation
Contributor
|
Amazing contribution @javierpazo, thank you! Can you resolve the conflics? |
This change brings concurrent multi-request execution to test_dflash
on a single GPU. It is internally one cohesive unit but can be split
into four conceptual pieces if a smaller review is preferred:
1. Multi TargetCache slots
- CLI: --target-cache-slots=N (alias --cache-slots=N)
- prefix `SLOT <id>` routes commands to a specific slot
- DaemonSlotState + RAII ActiveDaemonSlot for safe switching
- LIST_TARGET_CACHE_SLOTS for introspection
- all slots share target/draft weights; only KV/SSM/scratch is
per-slot
- create_target_cache gains an `n_seqs` parameter so a single
cache can be allocated batched up front
2. Tagged stream protocol (opt-in)
- --stream-tagged emits frames `[-2, request_id, token]` instead
of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE)
- parser recognises `REQ <id>` / `REQUEST <id>` headers
- legacy bare-int32 streaming is unchanged when the flag is off
- this lets a client demux multiple concurrent requests over the
same stdout
3. Native quantum scheduler
- dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN,
LIST_REQUESTS
- cursor-based fair round-robin between admitted requests
- non-blocking reader thread admits new requests during a drain
- PendingQuantum{slot, req, epoch, n_gen} carries the unit of work
- CONTINUE / CONT resumes a slot without re-prefilling
- REQ <id> CANCEL invalidates a request and bumps the slot epoch
so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate
refuse to overwrite a slot that is owned by an active scheduler
request
4. Fused batched target step (CUDA path)
- new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
SCHED_BATCH_DRAIN
- QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts
n_seqs > 1
- target_feat is allocated as [5*hidden, target_feat_cap, n_seqs]
when batched and the chain forwards capture features per-seq
- batch_probe_compare_ok smoke shows mismatches=0 vs the
single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed
pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the
next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes
req12/req13 with two batched steps each
- rollback for partially accepted draft tokens, multi-token verify
and parent-id propagation in the batched path are noted as
follow-ups; today the batched step accepts the cleanest case
and falls back to single-seq when needed
Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target +
Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0):
- Two concurrent requests:
REQ 4 START SLOT 0 quantum=2
REQ 5 START SLOT 1 quantum=2
SCHED_DRAIN closes both clean.
slot 0: 18.41 tok/s, slot 1: 22.50 tok/s
- Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes
without re-prefill.
- batch_probe_compare_ok mismatches=0 over a 2-seq probe.
- batch_tail_commit count=2 ms=29.26.
- batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly
back to the DFlash single-seq path.
Compatibility:
- All new behaviour is opt-in. Default invocation of test_dflash
with no scheduler flags keeps the legacy single-request path.
- Tagged stream is gated behind --stream-tagged.
- Multi-slot is gated behind --target-cache-slots=N (default N=1).
- Batched target step is reached only via the SCHED_BATCH_* command
family; legacy SCHED_STEP keeps using the single-seq path.
- Hot-loop diagnostic logs (sync_us / step_debug) are now gated
behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
default path is unchanged.
Verification vs existing community PRs:
- No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for
a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL +
epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and
PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs
but neither exposes a multi-request slot protocol.
- No upstream collision found for tagged stream framing or
--target-cache-slots.
Happy to split this into four sequential PRs (slots / tagged stream /
quantum scheduler / batched target step) if a smaller-grained review
is preferred — let me know.
Author: Javier Pazo <xabicasa@gmail.com>
a28b609 to
561b0ac
Compare
Contributor
Author
|
@davide221 thanks! Rebased on top of fresh main, conflicts resolved. The two collisions were tiny (a comment in internal.h::DraftLayer::is_swa and the [cfg] log line in est_dflash.cpp); merged the two log lines so both the upstream draft_swa / draft_ctx_max / arget_split_dflash flags and this PR's arget_cache_slots / stream_tagged show up. The big ~900-line block from upstream's new target-split path is preserved untouched. |
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 27, 2026
Record the 2026-05-27 18:17 scheduled run, including fresh PR classification and targeted worktree/Codex probes for stale PRs Luce-Org#137 and Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Revalidate PRs Luce-Org#137 and Luce-Org#135 in isolated worktrees, record conflict shape and current-layout recommendations. Upstream main remains unchanged at 4f4d82e.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Record a fresh PR Luce-Org#135 conflict probe and tmux-driven Codex feasibility report. The stack remains aligned with origin/main; no code changes were integrated.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 28, 2026
Refresh unattended integration manifest after fresh direct merge probes and a Codex feasibility pass for PR Luce-Org#135. No source stack changes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-28 20:23 cron revalidation: upstream and carried PR heads remain current, draft Luce-Org#304 is excluded, fresh conflicted-PR probes were retained, and a tmux-driven Codex inspection keeps Luce-Org#135 as a designed current-layout port instead of a mechanical merge.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-28 23:28 cron pass, repeated conflict probes, and the Codex salvage assessment for PR Luce-Org#135. No source stack rewrite was needed because origin/main and carried non-draft PR heads were already current.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 00:41 cron preflight, open PR classification, repeat worktree merge probes, and fresh delegated PR Luce-Org#135 attempts. No source stack rewrite was needed because origin/main and carried mergeable PR heads were already included.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 01:00 EDT unattended integration run, fresh merge probes, and Codex feasibility output for PR Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 01:51 cron preflight, fresh PR probes, and the new Luce-Org#135 Claude/Codex delegation results. No PR stack code changes were needed because all safe non-draft heads are already ancestors of easel/auto-integration.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 02:51 cron refresh, current PR classifications, fresh conflict probes, and tmux delegation outcomes for PRs Luce-Org#237 and Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 03:22 cron refresh, fresh conflict probes for remaining non-draft PRs, and a tmux-driven Codex feasibility review for PR Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 06:29 unattended refresh, including fresh conflict probes for the remaining non-draft selective-port candidates and an inconclusive Luce-Org#135 delegated review attempt.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 07:22 EDT unattended refresh, including current PR containment, fresh conflict probes, and the Codex feasibility report for PR Luce-Org#135. No contributor code changed in this refresh.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 09:42 cron preflight, fresh conflicted-PR probes, and tmux-driven Claude/Codex delegation results for PRs Luce-Org#237 and Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the latest cron preflight, direct conflict probes, and the new Codex salvage report for PR Luce-Org#135. No contributor code changed; all cleanly mergeable non-draft PR heads remain contained in the stack.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 13:12 EDT reconciliation pass: all mergeable non-draft PR heads remain included, remaining old-layout probes still conflict, Claude Luce-Org#237 again hit its turn limit, and Codex Luce-Org#135 produced a usable selective-port target list.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 19:20 unattended probe run, including fresh direct-merge conflict probes for unresolved old-layout PRs and a tmux Codex feasibility report for PR Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 02:26 cron preflight, fresh direct-merge probes for remaining old-layout PRs, and the tmux Codex Luce-Org#135 feasibility report. No source code changed.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 04:18 unattended integration pass, exact PR containment checks, renewed conflict probes, and the Codex Luce-Org#135 selective-port assessment.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 05:29 unattended integration run, including refreshed PR containment, conflict probes, the read-only Luce-Org#135 Codex delegation attempt, verification commands, and retained worktree paths.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 06:06 unattended run, unchanged PR containment, fresh conflict probes, and the Codex Luce-Org#135 selective-port assessment.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 07:24 cron pass, fresh conflict probes, and the Codex Luce-Org#135 selective-port plan. No PR stack code changes were needed.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 30, 2026
Record the 2026-05-30 17:51 cron preflight/probe pass, including fresh conflict probes for the remaining non-integrated PRs and a tmux/Codex feasibility audit for PR Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record 2026-05-31 04:10 cron reconciliation: no new PR heads, fresh conflict probes for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and failed read-only delegation attempts for Luce-Org#237.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record the 2026-05-31 05:55 unattended refresh, fresh probes for the remaining non-ancestor PRs, and the completed Codex feasibility review for PR Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a minimal PR Luce-Org#135 salvage slice into the current qwen35 target graph. Adds n_seqs-aware prefill-only cache allocation and graph guards while preserving existing n_seqs=1 behavior. Full scheduler/copyback commands remain unported pending build/runtime validation.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record the 2026-05-31 06:49 cron reconciliation, Luce-Org#135 delegated selective-port attempt, validation outcomes, retained worktrees, and remaining PR classifications.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Selectively ports a small PR Luce-Org#135 request-framing slice onto the current server/test_dflash harness. Legacy stream output remains unchanged unless --stream-tagged is enabled.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a narrow PR Luce-Org#135 slice onto the current qwen35 target graph. Batched prefill caches now allocate target_feat with a sequence dimension, the batched graph accepts feature capture when cache dimensions match, and capture copies are emitted per sequence while preserving the existing single-sequence buffer layout.\n\nValidation: git diff --check. Full CMake validation remains blocked locally by missing populated ggml headers under server/deps/llama.cpp and the known CUDA compiler-id sm_52 toolchain issue.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record the 2026-05-31 09:41 cron preflight, current PR-head containment, fresh merge-probe conflict counts, and the no-edit Luce-Org#135 Claude/Codex review findings. No source code changed in this refresh.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record the 2026-05-31 11:01 cron pass: exact PR-head containment stayed unchanged at 21 included non-draft PRs, six selective-port candidates remain, and a tmux-driven Codex attempt for the next PR Luce-Org#135 multi-cache-slot slice left the conflicted probe unresolved with no source changes promoted.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a narrow slice of PR Luce-Org#135 into the current stack: daemon cache-slot parsing, independent extra TargetCache state, graph/feature-mirror swapping, and cleanup handling. Refresh auto-integration manifest after merging advanced PR Luce-Org#285.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a narrow PR Luce-Org#135 daemon-cache-slot follow-up into the current stack: LIST_TARGET_CACHE_SLOTS / LIST_CACHE_SLOTS now report slot count, active slot, and per-slot readiness/cur_pos/last_tok while respecting the active-slot RAII swap. Refresh the auto-integration manifest with current PR classifications and validation notes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a narrow Luce-Org#135 control-plane slice into the current test_dflash daemon path. Add bounded request bookkeeping for REQ/REQUEST-prefixed calls plus LIST_REQUESTS and CANCEL command scaffolding that records state without enabling live scheduler mutation. Refresh the auto-integration manifest with current PR classifications, probe results, delegation evidence, and local validation.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Continue the Luce-Org#135 selective-port stack with diagnostic-only SCHED_STEP and SCHED_DRAIN daemon commands. They report request counts and active/per-slot target-cache state without mutating live scheduler state. Refresh the auto-integration manifest and record the latest Luce-Org#285 head merge.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a diagnostic-only slice from PR Luce-Org#135 into test_dflash: an opt-in aligned scheduler bucket selftest that uses local structs and does not mutate daemon scheduling state.\n\nRefresh the auto-integration manifest with current PR classification, probe results, delegation notes, retained worktrees, and validation outcomes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Port a narrow Luce-Org#135 diagnostic-only slice into the integration stack. SCHED_BATCH_PEEK inspects existing daemon request/cache-slot state and applies the already-carried aligned-bucket selector without mutating scheduler or cache state. Refresh the auto-integration manifest with current probe results.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 unattended PR integration pass, updated PR Luce-Org#285 head containment, current selective-port conflict counts, and delegated review conclusions for the remaining Luce-Org#321/Luce-Org#325/Luce-Org#135 slices.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Port a narrow PR Luce-Org#135 diagnostic-only daemon command that inspects current request/cache-slot state and reports aligned batch readiness without graph execution, cache copyback, or request-state mutation. Refresh auto-integration metadata and validation notes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Selective-port a narrow PR Luce-Org#135 cache-reset slice identified by the latest conflicted probe. reset_target_cache now clears TargetCache::last_tok with cur_pos so reused caches cannot retain a stale decode seed.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 10:14 unattended refresh, direct-merge probes, delegated PR Luce-Org#135/Luce-Org#237 attempts, the promoted Luce-Org#135 cache-reset slice, and validation outcomes.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Refresh the unattended auto-integration manifest after the 2026-06-01 10:34 run. No contributor PR head advanced; direct probes still leave Luce-Org#305, Luce-Org#237, Luce-Org#221, Luce-Org#154, Luce-Org#153, and Luce-Org#135 as selective-port/runtime-validation candidates.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Promote a tiny PR Luce-Org#135 selective-port slice: daemon-mode DFlash snapshot bookkeeping now records the committed generation boundary rather than prompt-plus-output vector length.\n\nRefresh the auto-integration manifest with the latest PR classifications, fresh conflict probes, and the tmux-driven Codex feasibility result.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 16:07 EDT unattended refresh, exact PR-head containment, fresh probe conflict counts, and the Luce-Org#135 Claude feasibility attempt outcome.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 17:44 cron preflight, exact PR-head containment, fresh direct-merge probes for the six remaining selective-port candidates, and the stopped tmux Codex Luce-Org#135 feasibility attempt.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Promote a narrow PR Luce-Org#135 salvage slice identified by tmux-driven Codex. The harness can now synthesize prompt token vectors with --synthetic-prompt-tokens/--synthetic-token plus --n-gen/--out, which makes CUDA smoke runs easier without requiring a prompt_ids.bin fixture. Refresh auto-integration manifest with current PR classifications and probe/delegation results.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the 2026-06-01 20:00 unattended reconciliation pass, fresh conflict probes for the six remaining selective-port candidates, and the latest Luce-Org#135 Codex feasibility result.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the 2026-06-02 01:07 unattended run: no new non-draft PR heads advanced, direct-merge probes still conflict for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and a tmux Codex Luce-Org#221 pass found only the already-represented gguf_metadata header slice.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Promote a small PR Luce-Org#135 salvage slice in the current Qwen35 target-cache layout: initial rollback-cache SSM intermediates now use F16, matching migrate_prefill_cache and the existing typed rollback readers. Refresh the auto-integration manifest with the probe/delegation evidence.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the latest open-PR containment check and direct-merge probes. Document the Luce-Org#135 Codex feasibility pass, which found no remaining safe slice after the prior rollback-cache salvage.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Port a narrow safe cleanup slice identified from PR Luce-Org#135: if the daemon test lazily loaded the pFlash drafter and did not receive an explicit free command, release it before final graph/cache teardown.\n\nRefresh the auto-integration manifest with the current PR classification, direct-merge conflict counts, and the tmux-driven Codex feasibility result.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the 2026-06-02 07:27 cron refresh after merging origin/main a81128b into the auto-integration stack. Reconfirm open PR accounting, retained conflicted candidates, direct-merge probe counts, and the Codex no-safe-slice review for PR Luce-Org#135.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 2, 2026
Record the 10:03 cron pass: current refs, open PR accounting, fresh conflict probes for the six remaining selective-port candidates, and a tmux-driven Codex NO_SAFE_SLICE review for PR Luce-Org#135. No source changes were promoted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings concurrent multi-request execution to
test_dflashon asingle GPU. Internally one cohesive unit; happy to split into
four sequential PRs (A / B / C / D below) if you prefer per
CONTRIBUTING's "one concern per PR" — let me know and I'll
re-open as a chain. I kept it bundled because the four pieces
share the same hunks of
test_dflash.cpp(~+2130 lines) andsplitting cleanly would require careful hunk surgery; doing it on
request is fine.
Pieces in this PR
A. Multi
TargetCacheslots--target-cache-slots=N(alias--cache-slots=N)SLOT <id>routes commands to a specific slotDaemonSlotState+ RAIIActiveDaemonSlotfor safe switchingLIST_TARGET_CACHE_SLOTSfor introspectionper-slot
create_target_cachegains ann_seqsparameter so a singlecache can be allocated batched up front
B. Tagged stream protocol (opt-in)
--stream-taggedemits frames[-2, request_id, token]insteadof bare int32 tokens; sentinels
-4(CONTINUE),-1(DONE)REQ <id>/REQUEST <id>headersstdout
C. Native quantum scheduler
REQ/SLOT/START,SCHED_STEP,SCHED_DRAIN,LIST_REQUESTSPendingQuantum{slot, req, epoch, n_gen}carries the unit ofwork
CONTINUE/CONTresumes a slot without re-prefillingREQ <id> CANCELinvalidates a request and bumps the slotepoch so a stale
CONTINUEis rejected;RESTORE_CHAINandlegacy
generaterefuse to overwrite a slot that is owned byan active scheduler request
D. Fused batched target step (CUDA path)
SCHED_BATCH_PEEK,SCHED_BATCH_PROBE,SCHED_BATCH_TARGET_TAIL,SCHED_BATCH_TARGET_STEP,SCHED_BATCH_DRAINQwenGraphInputsgainsn_seqs;build_delta_net_blockaccepts
n_seqs > 1target_featis allocated as[5*hidden, target_feat_cap, n_seqs]when batched and the chain forwards capture featuresper-seq
and parent-id propagation in the batched path are noted as
follow-ups; today the batched step accepts the cleanest case
and falls back to single-seq when needed
Validation
Per CONTRIBUTING ("benchmark before and after on the same hardware,
same warmup"). Single GPU1 RTX 6000 Ada (sm_89), Heretic Q4_K_M
target, Q8 GGUF or FP16 safetensors drafter,
FA_WINDOW=0, KVq4_0/q4_0:REQ 4 START SLOT 0 quantum=2+REQ 5 START SLOT 1 quantum=2, thenSCHED_DRAINREQ 6CONTINUEon slot 0 resumes without re-prefillbatch_probe_compare_okover a 2-seq probebatch_tail_commit(2 completed pending quanta)batch_step_commitfollowed bySCHED_DRAINMethodology: warmup of 1 request before measurement; same
--budgetand KV-quant settings across runs; nothing else competing on the GPU
during the measurement window.
Compatibility
test_dflashwith no scheduler flags keeps the legacy single-request path
byte-identical.
--stream-tagged.--target-cache-slots=N(defaultN=1).SCHED_BATCH_*commandfamily; legacy
SCHED_STEPkeeps using the single-seq path.sync_us/step_debug) are gatedbehind
DFLASH27B_TIMING_DEBUG/DFLASH27B_STEP_DEBUGso thedefault path is unchanged.
Verification vs existing community PRs
SCHED_BATCH_*protocolor for a native C++ quantum scheduler with
REQ/SLOT/CONTINUE/CANCEL + epoch hardening. Checked against:
is single-seq.
per spec-decode step) — splits but stays single-seq.
--target-cache-slots.Notes
that drifted from
main. If a hunk fails to apply on a freshrebase or you spot anything off, ping me and I'll fix on the
spot rather than push through.
gguf_draft_loaderfallback, FP16 safetensors drafter, daemonscripts improvements, SWA mask wiring + contract test, PFlash
operator notes) are sitting on
https://github.com/javierpazo/lucebox-hub. Holding off onopening those until this one is in a known state.
Javier Pazó — @xabicasa — xabicasa@gmail.com