feat(dflash): native multi-request scheduler with batched target step by javierpazo · Pull Request #135 · Luce-Org/lucebox-hub

javierpazo · 2026-05-09T09:50:09Z

Summary

Brings concurrent multi-request execution to test_dflash on a
single GPU. Internally one cohesive unit; happy to split into
four sequential PRs (A / B / C / D below) if you prefer per
CONTRIBUTING's "one concern per PR" — let me know and I'll
re-open as a chain. I kept it bundled because the four pieces
share the same hunks of test_dflash.cpp (~+2130 lines) and
splitting cleanly would require careful hunk surgery; doing it on
request is fine.

Pieces in this PR

A. Multi `TargetCache` slots

CLI: --target-cache-slots=N (alias --cache-slots=N)
prefix SLOT <id> routes commands to a specific slot
DaemonSlotState + RAII ActiveDaemonSlot for safe switching
LIST_TARGET_CACHE_SLOTS for introspection
all slots share target/draft weights; only KV / SSM / scratch is
per-slot
create_target_cache gains an n_seqs parameter so a single
cache can be allocated batched up front

B. Tagged stream protocol (opt-in)

--stream-tagged emits frames [-2, request_id, token] instead
of bare int32 tokens; sentinels -4 (CONTINUE), -1 (DONE)
parser recognises REQ <id> / REQUEST <id> headers
legacy bare-int32 streaming is unchanged when the flag is off
lets a client demux multiple concurrent requests over the same
stdout

C. Native quantum scheduler

dispatch table for REQ/SLOT/START, SCHED_STEP,
SCHED_DRAIN, LIST_REQUESTS
cursor-based fair round-robin between admitted requests
non-blocking reader thread admits new requests during a drain
PendingQuantum{slot, req, epoch, n_gen} carries the unit of
work
CONTINUE / CONT resumes a slot without re-prefilling
REQ <id> CANCEL invalidates a request and bumps the slot
epoch so a stale CONTINUE is rejected; RESTORE_CHAIN and
legacy generate refuse to overwrite a slot that is owned by
an active scheduler request

D. Fused batched target step (CUDA path)

new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
SCHED_BATCH_DRAIN
QwenGraphInputs gains n_seqs; build_delta_net_block
accepts n_seqs > 1
target_feat is allocated as [5*hidden, target_feat_cap, n_seqs] when batched and the chain forwards capture features
per-seq
rollback for partially accepted draft tokens, multi-token verify
and parent-id propagation in the batched path are noted as
follow-ups; today the batched step accepts the cleanest case
and falls back to single-seq when needed

Validation

Per CONTRIBUTING ("benchmark before and after on the same hardware,
same warmup"). Single GPU1 RTX 6000 Ada (sm_89), Heretic Q4_K_M
target, Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV
q4_0/q4_0:

Scenario	Result
Two concurrent requests, `REQ 4 START SLOT 0 quantum=2` + `REQ 5 START SLOT 1 quantum=2`, then `SCHED_DRAIN`	closes both clean; slot 0 = 18.41 tok/s, slot 1 = 22.50 tok/s
Mid-drain admission of `REQ 6`	succeeds; `CONTINUE` on slot 0 resumes without re-prefill
`batch_probe_compare_ok` over a 2-seq probe	mismatches = 0 vs the single-seq path
`batch_tail_commit` (2 completed pending quanta)	29.26 ms
`batch_step_commit` followed by `SCHED_DRAIN`	29.57 ms, then reverts cleanly back to the DFlash single-seq path

Methodology: warmup of 1 request before measurement; same --budget
and KV-quant settings across runs; nothing else competing on the GPU
during the measurement window.

Compatibility

All new behaviour is opt-in. Default invocation of test_dflash
with no scheduler flags keeps the legacy single-request path
byte-identical.
Tagged stream gated behind --stream-tagged.
Multi-slot gated behind --target-cache-slots=N (default N=1).
Batched target step reached only via the SCHED_BATCH_* command
family; legacy SCHED_STEP keeps using the single-seq path.
Hot-loop diagnostic logs (sync_us / step_debug) are gated
behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
default path is unchanged.

Verification vs existing community PRs

No prior art in lucebox-hub for the SCHED_BATCH_* protocol
or for a native C++ quantum scheduler with
REQ/SLOT/CONTINUE/CANCEL + epoch hardening. Checked against:
- PR feat(dflash): MoE 35B-A3B support + DDTree CUDA graph reuse #39 (CUDA graph reuse, MoE 35B-A3B + DDTree) — graph reuse
  is single-seq.
- PR dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55) #62 (split target/draft StepGraphs to fix gallocr realloc
  per spec-decode step) — splits but stays single-seq.
No upstream collision found for tagged stream framing or
--target-cache-slots.

Notes

Diff size warning: this branch was extracted from a working tree
that drifted from main. If a hunk fails to apply on a fresh
rebase or you spot anything off, ping me and I'll fix on the
spot rather than push through.
Companion branches with smaller follow-ups (CMake sm_89 / BSA,
gguf_draft_loader fallback, FP16 safetensors drafter, daemon
scripts improvements, SWA mask wiring + contract test, PFlash
operator notes) are sitting on
https://github.com/javierpazo/lucebox-hub. Holding off on
opening those until this one is in a known state.

Javier Pazó — @xabicasa — xabicasa@gmail.com

davide221 · 2026-05-09T19:22:23Z

Amazing contribution @javierpazo, thank you! Can you resolve the conflics?

This change brings concurrent multi-request execution to test_dflash on a single GPU. It is internally one cohesive unit but can be split into four conceptual pieces if a smaller review is preferred: 1. Multi TargetCache slots - CLI: --target-cache-slots=N (alias --cache-slots=N) - prefix `SLOT <id>` routes commands to a specific slot - DaemonSlotState + RAII ActiveDaemonSlot for safe switching - LIST_TARGET_CACHE_SLOTS for introspection - all slots share target/draft weights; only KV/SSM/scratch is per-slot - create_target_cache gains an `n_seqs` parameter so a single cache can be allocated batched up front 2. Tagged stream protocol (opt-in) - --stream-tagged emits frames `[-2, request_id, token]` instead of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE) - parser recognises `REQ <id>` / `REQUEST <id>` headers - legacy bare-int32 streaming is unchanged when the flag is off - this lets a client demux multiple concurrent requests over the same stdout 3. Native quantum scheduler - dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN, LIST_REQUESTS - cursor-based fair round-robin between admitted requests - non-blocking reader thread admits new requests during a drain - PendingQuantum{slot, req, epoch, n_gen} carries the unit of work - CONTINUE / CONT resumes a slot without re-prefilling - REQ <id> CANCEL invalidates a request and bumps the slot epoch so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate refuse to overwrite a slot that is owned by an active scheduler request 4. Fused batched target step (CUDA path) - new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE, SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP, SCHED_BATCH_DRAIN - QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts n_seqs > 1 - target_feat is allocated as [5*hidden, target_feat_cap, n_seqs] when batched and the chain forwards capture features per-seq - batch_probe_compare_ok smoke shows mismatches=0 vs the single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes req12/req13 with two batched steps each - rollback for partially accepted draft tokens, multi-token verify and parent-id propagation in the batched path are noted as follow-ups; today the batched step accepts the cleanest case and falls back to single-seq when needed Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target + Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0): - Two concurrent requests: REQ 4 START SLOT 0 quantum=2 REQ 5 START SLOT 1 quantum=2 SCHED_DRAIN closes both clean. slot 0: 18.41 tok/s, slot 1: 22.50 tok/s - Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes without re-prefill. - batch_probe_compare_ok mismatches=0 over a 2-seq probe. - batch_tail_commit count=2 ms=29.26. - batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly back to the DFlash single-seq path. Compatibility: - All new behaviour is opt-in. Default invocation of test_dflash with no scheduler flags keeps the legacy single-request path. - Tagged stream is gated behind --stream-tagged. - Multi-slot is gated behind --target-cache-slots=N (default N=1). - Batched target step is reached only via the SCHED_BATCH_* command family; legacy SCHED_STEP keeps using the single-seq path. - Hot-loop diagnostic logs (sync_us / step_debug) are now gated behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the default path is unchanged. Verification vs existing community PRs: - No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL + epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs but neither exposes a multi-request slot protocol. - No upstream collision found for tagged stream framing or --target-cache-slots. Happy to split this into four sequential PRs (slots / tagged stream / quantum scheduler / batched target step) if a smaller-grained review is preferred — let me know. Author: Javier Pazo <xabicasa@gmail.com>

javierpazo · 2026-05-10T19:24:06Z

@davide221 thanks! Rebased on top of fresh main, conflicts resolved. The two collisions were tiny (a comment in internal.h::DraftLayer::is_swa and the [cfg] log line in est_dflash.cpp); merged the two log lines so both the upstream draft_swa / draft_ctx_max / arget_split_dflash flags and this PR's arget_cache_slots / stream_tagged show up. The big ~900-line block from upstream's new target-split path is preserved untouched.

Record the 2026-05-27 18:17 scheduled run, including fresh PR classification and targeted worktree/Codex probes for stale PRs Luce-Org#137 and Luce-Org#135.

Revalidate PRs Luce-Org#137 and Luce-Org#135 in isolated worktrees, record conflict shape and current-layout recommendations. Upstream main remains unchanged at 4f4d82e.

Record a fresh PR Luce-Org#135 conflict probe and tmux-driven Codex feasibility report. The stack remains aligned with origin/main; no code changes were integrated.

Refresh unattended integration manifest after fresh direct merge probes and a Codex feasibility pass for PR Luce-Org#135. No source stack changes.

Record the 2026-05-28 20:23 cron revalidation: upstream and carried PR heads remain current, draft Luce-Org#304 is excluded, fresh conflicted-PR probes were retained, and a tmux-driven Codex inspection keeps Luce-Org#135 as a designed current-layout port instead of a mechanical merge.

Record the 2026-05-28 23:28 cron pass, repeated conflict probes, and the Codex salvage assessment for PR Luce-Org#135. No source stack rewrite was needed because origin/main and carried non-draft PR heads were already current.

Record the 2026-05-29 00:41 cron preflight, open PR classification, repeat worktree merge probes, and fresh delegated PR Luce-Org#135 attempts. No source stack rewrite was needed because origin/main and carried mergeable PR heads were already included.

Record the 2026-05-29 01:00 EDT unattended integration run, fresh merge probes, and Codex feasibility output for PR Luce-Org#135.

Record the 2026-05-29 01:51 cron preflight, fresh PR probes, and the new Luce-Org#135 Claude/Codex delegation results. No PR stack code changes were needed because all safe non-draft heads are already ancestors of easel/auto-integration.

Record the 2026-05-29 02:51 cron refresh, current PR classifications, fresh conflict probes, and tmux delegation outcomes for PRs Luce-Org#237 and Luce-Org#135.

Record the 2026-05-29 03:22 cron refresh, fresh conflict probes for remaining non-draft PRs, and a tmux-driven Codex feasibility review for PR Luce-Org#135.

Record the 2026-05-29 06:29 unattended refresh, including fresh conflict probes for the remaining non-draft selective-port candidates and an inconclusive Luce-Org#135 delegated review attempt.

Record the 2026-05-29 07:22 EDT unattended refresh, including current PR containment, fresh conflict probes, and the Codex feasibility report for PR Luce-Org#135. No contributor code changed in this refresh.

Record the 2026-05-29 09:42 cron preflight, fresh conflicted-PR probes, and tmux-driven Claude/Codex delegation results for PRs Luce-Org#237 and Luce-Org#135.

Record the latest cron preflight, direct conflict probes, and the new Codex salvage report for PR Luce-Org#135. No contributor code changed; all cleanly mergeable non-draft PR heads remain contained in the stack.

Record the 2026-05-29 13:12 EDT reconciliation pass: all mergeable non-draft PR heads remain included, remaining old-layout probes still conflict, Claude Luce-Org#237 again hit its turn limit, and Codex Luce-Org#135 produced a usable selective-port target list.

Record the 2026-05-29 19:20 unattended probe run, including fresh direct-merge conflict probes for unresolved old-layout PRs and a tmux Codex feasibility report for PR Luce-Org#135.

Record the 2026-05-30 02:26 cron preflight, fresh direct-merge probes for remaining old-layout PRs, and the tmux Codex Luce-Org#135 feasibility report. No source code changed.

Record the 2026-05-30 04:18 unattended integration pass, exact PR containment checks, renewed conflict probes, and the Codex Luce-Org#135 selective-port assessment.

Record the 2026-05-30 05:29 unattended integration run, including refreshed PR containment, conflict probes, the read-only Luce-Org#135 Codex delegation attempt, verification commands, and retained worktree paths.

Record the 2026-05-30 06:06 unattended run, unchanged PR containment, fresh conflict probes, and the Codex Luce-Org#135 selective-port assessment.

Record the 2026-05-30 07:24 cron pass, fresh conflict probes, and the Codex Luce-Org#135 selective-port plan. No PR stack code changes were needed.

Record the 2026-05-30 17:51 cron preflight/probe pass, including fresh conflict probes for the remaining non-integrated PRs and a tmux/Codex feasibility audit for PR Luce-Org#135.

Record 2026-05-31 04:10 cron reconciliation: no new PR heads, fresh conflict probes for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and failed read-only delegation attempts for Luce-Org#237.

Record the 2026-05-31 05:55 unattended refresh, fresh probes for the remaining non-ancestor PRs, and the completed Codex feasibility review for PR Luce-Org#135.

Port a minimal PR Luce-Org#135 salvage slice into the current qwen35 target graph. Adds n_seqs-aware prefill-only cache allocation and graph guards while preserving existing n_seqs=1 behavior. Full scheduler/copyback commands remain unported pending build/runtime validation.

Record the 2026-05-31 06:49 cron reconciliation, Luce-Org#135 delegated selective-port attempt, validation outcomes, retained worktrees, and remaining PR classifications.

Selectively ports a small PR Luce-Org#135 request-framing slice onto the current server/test_dflash harness. Legacy stream output remains unchanged unless --stream-tagged is enabled.

Port a narrow PR Luce-Org#135 slice onto the current qwen35 target graph. Batched prefill caches now allocate target_feat with a sequence dimension, the batched graph accepts feature capture when cache dimensions match, and capture copies are emitted per sequence while preserving the existing single-sequence buffer layout.\n\nValidation: git diff --check. Full CMake validation remains blocked locally by missing populated ggml headers under server/deps/llama.cpp and the known CUDA compiler-id sm_52 toolchain issue.

Record the 2026-05-31 09:41 cron preflight, current PR-head containment, fresh merge-probe conflict counts, and the no-edit Luce-Org#135 Claude/Codex review findings. No source code changed in this refresh.

Record the 2026-05-31 11:01 cron pass: exact PR-head containment stayed unchanged at 21 included non-draft PRs, six selective-port candidates remain, and a tmux-driven Codex attempt for the next PR Luce-Org#135 multi-cache-slot slice left the conflicted probe unresolved with no source changes promoted.

Port a narrow slice of PR Luce-Org#135 into the current stack: daemon cache-slot parsing, independent extra TargetCache state, graph/feature-mirror swapping, and cleanup handling. Refresh auto-integration manifest after merging advanced PR Luce-Org#285.

Port a narrow PR Luce-Org#135 daemon-cache-slot follow-up into the current stack: LIST_TARGET_CACHE_SLOTS / LIST_CACHE_SLOTS now report slot count, active slot, and per-slot readiness/cur_pos/last_tok while respecting the active-slot RAII swap. Refresh the auto-integration manifest with current PR classifications and validation notes.

Port a narrow Luce-Org#135 control-plane slice into the current test_dflash daemon path. Add bounded request bookkeeping for REQ/REQUEST-prefixed calls plus LIST_REQUESTS and CANCEL command scaffolding that records state without enabling live scheduler mutation. Refresh the auto-integration manifest with current PR classifications, probe results, delegation evidence, and local validation.

Continue the Luce-Org#135 selective-port stack with diagnostic-only SCHED_STEP and SCHED_DRAIN daemon commands. They report request counts and active/per-slot target-cache state without mutating live scheduler state. Refresh the auto-integration manifest and record the latest Luce-Org#285 head merge.

Port a diagnostic-only slice from PR Luce-Org#135 into test_dflash: an opt-in aligned scheduler bucket selftest that uses local structs and does not mutate daemon scheduling state.\n\nRefresh the auto-integration manifest with current PR classification, probe results, delegation notes, retained worktrees, and validation outcomes.

Port a narrow Luce-Org#135 diagnostic-only slice into the integration stack. SCHED_BATCH_PEEK inspects existing daemon request/cache-slot state and applies the already-carried aligned-bucket selector without mutating scheduler or cache state. Refresh the auto-integration manifest with current probe results.

Record the 2026-06-01 unattended PR integration pass, updated PR Luce-Org#285 head containment, current selective-port conflict counts, and delegated review conclusions for the remaining Luce-Org#321/Luce-Org#325/Luce-Org#135 slices.

Port a narrow PR Luce-Org#135 diagnostic-only daemon command that inspects current request/cache-slot state and reports aligned batch readiness without graph execution, cache copyback, or request-state mutation. Refresh auto-integration metadata and validation notes.

Selective-port a narrow PR Luce-Org#135 cache-reset slice identified by the latest conflicted probe. reset_target_cache now clears TargetCache::last_tok with cur_pos so reused caches cannot retain a stale decode seed.

Record the 2026-06-01 10:14 unattended refresh, direct-merge probes, delegated PR Luce-Org#135/Luce-Org#237 attempts, the promoted Luce-Org#135 cache-reset slice, and validation outcomes.

Refresh the unattended auto-integration manifest after the 2026-06-01 10:34 run. No contributor PR head advanced; direct probes still leave Luce-Org#305, Luce-Org#237, Luce-Org#221, Luce-Org#154, Luce-Org#153, and Luce-Org#135 as selective-port/runtime-validation candidates.

Promote a tiny PR Luce-Org#135 selective-port slice: daemon-mode DFlash snapshot bookkeeping now records the committed generation boundary rather than prompt-plus-output vector length.\n\nRefresh the auto-integration manifest with the latest PR classifications, fresh conflict probes, and the tmux-driven Codex feasibility result.

Record the 2026-06-01 16:07 EDT unattended refresh, exact PR-head containment, fresh probe conflict counts, and the Luce-Org#135 Claude feasibility attempt outcome.

Record the 2026-06-01 17:44 cron preflight, exact PR-head containment, fresh direct-merge probes for the six remaining selective-port candidates, and the stopped tmux Codex Luce-Org#135 feasibility attempt.

Promote a narrow PR Luce-Org#135 salvage slice identified by tmux-driven Codex. The harness can now synthesize prompt token vectors with --synthetic-prompt-tokens/--synthetic-token plus --n-gen/--out, which makes CUDA smoke runs easier without requiring a prompt_ids.bin fixture. Refresh auto-integration manifest with current PR classifications and probe/delegation results.

Record the 2026-06-01 20:00 unattended reconciliation pass, fresh conflict probes for the six remaining selective-port candidates, and the latest Luce-Org#135 Codex feasibility result.

Record the 2026-06-02 01:07 unattended run: no new non-draft PR heads advanced, direct-merge probes still conflict for Luce-Org#305/Luce-Org#237/Luce-Org#221/Luce-Org#154/Luce-Org#153/Luce-Org#135, and a tmux Codex Luce-Org#221 pass found only the already-represented gguf_metadata header slice.

Promote a small PR Luce-Org#135 salvage slice in the current Qwen35 target-cache layout: initial rollback-cache SSM intermediates now use F16, matching migrate_prefill_cache and the existing typed rollback readers. Refresh the auto-integration manifest with the probe/delegation evidence.

Record the latest open-PR containment check and direct-merge probes. Document the Luce-Org#135 Codex feasibility pass, which found no remaining safe slice after the prior rollback-cache salvage.

Port a narrow safe cleanup slice identified from PR Luce-Org#135: if the daemon test lazily loaded the pFlash drafter and did not receive an explicit free command, release it before final graph/cache teardown.\n\nRefresh the auto-integration manifest with the current PR classification, direct-merge conflict counts, and the tmux-driven Codex feasibility result.

Record the 2026-06-02 07:27 cron refresh after merging origin/main a81128b into the auto-integration stack. Reconfirm open PR accounting, retained conflicted candidates, direct-merge probe counts, and the Codex no-safe-slice review for PR Luce-Org#135.

Record the 10:03 cron pass: current refs, open PR accounting, fresh conflict probes for the six remaining selective-port candidates, and a tmux-driven Codex NO_SAFE_SLICE review for PR Luce-Org#135. No source changes were promoted.

javierpazo changed the title ~~dflash: native multi-request scheduler with batched target step~~ feat(dflash): native multi-request scheduler with batched target step May 9, 2026

javierpazo mentioned this pull request May 9, 2026

feat(dflash): daemon scripts improvements (GPU split, Windows, defaults) #139

Closed

javierpazo force-pushed the xabicasa/dflash-multi-request-scheduler-batched-target-step branch from a28b609 to 561b0ac Compare May 10, 2026 19:23

javierpazo mentioned this pull request May 11, 2026

feat(dflash): linear native MTP integrated decode CLI (stacked on #153) #154

Open

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 27, 2026

docs: refresh auto-integration triage

e656ff0

Record the 2026-05-27 18:17 scheduled run, including fresh PR classification and targeted worktree/Codex probes for stale PRs Luce-Org#137 and Luce-Org#135.

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026

docs: refresh auto-integration PR 135 triage

13ae15c

Refresh unattended integration manifest after fresh direct merge probes and a Codex feasibility pass for PR Luce-Org#135. No source stack changes.

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026

docs: refresh auto-integration manifest

d318694

Record the 2026-05-29 01:00 EDT unattended integration run, fresh merge probes, and Codex feasibility output for PR Luce-Org#135.

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026

docs: refresh auto-integration manifest

00dea67

Record the 2026-05-30 06:06 unattended run, unchanged PR containment, fresh conflict probes, and the Codex Luce-Org#135 selective-port assessment.

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 30, 2026

docs: refresh auto-integration manifest

d9bbc99

Record the 2026-05-30 07:24 cron pass, fresh conflict probes, and the Codex Luce-Org#135 selective-port plan. No PR stack code changes were needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): native multi-request scheduler with batched target step#135

feat(dflash): native multi-request scheduler with batched target step#135
javierpazo wants to merge 1 commit into
Luce-Org:mainfrom
javierpazo:xabicasa/dflash-multi-request-scheduler-batched-target-step

javierpazo commented May 9, 2026 •

edited

Loading

Uh oh!

davide221 commented May 9, 2026

Uh oh!

javierpazo commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

javierpazo commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pieces in this PR

A. Multi TargetCache slots

B. Tagged stream protocol (opt-in)

C. Native quantum scheduler

D. Fused batched target step (CUDA path)

Validation

Compatibility

Verification vs existing community PRs

Notes

Uh oh!

davide221 commented May 9, 2026

Uh oh!

javierpazo commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

javierpazo commented May 9, 2026 •

edited

Loading

A. Multi `TargetCache` slots