rocm: sync branch with main by 24601 · Pull Request #219 · antirez/ds4

24601 · 2026-05-21T18:38:48Z

Summary

This brings the rocm branch forward to current main (8d576642) while keeping the ROCm patch scoped to the existing community ROCm backend work.

Current state before this change:

rocm was based at 7a751eb
rocm was 85 commits behind main and 1 commit ahead

What changed:

merge current main into rocm with conflicts resolved
keep the ROCm backend support buildable against the current DS4 GPU code
add a make rocm target that builds the current main binaries through hipcc
keep HIP out of CUDA WMMA kernels and use the generic indexer-score fallback on ROCm
explicitly reject known wave64 GCN/CDNA ROCm targets for now, because DS4 GPU kernels still assume CUDA-style 32-lane warp math

This intentionally does not include the user-facing --rocm naming/MTP mapping work from #156 or the ROCm WMMA/indexer optimization work from #180.

Relation to #16

This is related to #16, but it should not close #16.

The goal here is to make the existing upstream rocm branch current and testable again on current main. It is not the later ROCm optimization work discussed in #16, and the speed numbers below show why this should stay draft until the remaining correctness/performance gaps are understood.

Scope notes

I checked prior ROCm-related PRs before keeping this narrowly scoped:

Add AMD ROCm/HIP Support and Strix Halo Optimizations #118 was broader than a branch refresh and overlapped with the existing rocm branch
rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu #133 attempted a prior ROCm branch update and surfaced remaining long-context risk
Fix ROCm build/runtime naming and MTP model mapping #156 is a separate CLI/runtime naming and MTP mapping fix
rocm: enable wmma indexer support #180 is a separate ROCm WMMA/indexer optimization effort

This PR is therefore only a branch sync plus the minimum build fixes needed for current main.

Validation environment

AMD Ryzen AI Max+ 395 / Radeon 8060S
ROCm target: gfx1151
HIP/ROCm toolchain: HIP 7.2.53211-364a905
Backend selected by DS4: cuda API path over ROCm/HIP
Model: official project q2-imatrix GGUF, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
Model size reported by DS4: 80.76 GiB, 284.33B logical parameters
PR head tested: e13de731787799310e53bffe98b1f077f7c96b50

Build and unit validation

Commands run:

make clean
make rocm ROCM_ARCH=gfx1151 -j$(nproc)
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx1151 -j$(nproc)
./ds4_test --server
make clean
make cpu -j$(nproc)

Results:

ROCm build: PASS
ROCm-linked ds4_test --server: PASS (server: OK, ds4 tests: ok)
CPU build: PASS

Negative target guard check:

make clean
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx90a -j$(nproc)

Result: expected compile failure with the new wave64 GCN/CDNA unsupported-target error.

Existing warnings observed during the builds:

ds4_server.c: const-discard warning in stop_list_find_from
ds4_agent.c: existing snprintf truncation warnings

Model-backed validation

Model load / inspect:

./ds4 --inspect --cuda -m ds4flash.gguf

Result: PASS. DS4 loaded the official q2-imatrix GGUF and initialized the ROCm/HIP backend.

Short deterministic generation:

./ds4 --cuda -m ds4flash.gguf --ctx 4096 --nothink --temp 0 -n 32 -p 'Reply exactly: ROCm OK'

Result: PASS. Output was ROCm OK.

Reported speed for this tiny prompt:

prefill: 0.46 t/s, generation: 8.96 t/s

Tool-call quality:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --tool-call-quality

Result: PASS in both fast and exact paths.

Long-context story recall:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --long-context

Result: PASS (long-context: OK, ds4 tests: ok).

Runtime note: this run took about one hour wall time on the validation machine. Progress markers reached 8192/30474, 16384/30474, 24576/30474, and 30474/30474; correctness passed, but throughput is not yet competitive with the best ROCm numbers discussed in #16.

Official logprob vectors:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --logprob-vectors

Result: FAIL with one mismatch:

ds4-test: vector short_code_completion step 1 selected token mismatch
logprob-vectors: ERR
ds4 tests: 1 failure(s)

Manual dump for the failing prompt showed ROCm selecting uppercase C after the opening code fence while the fixture expects lowercase c:

step 1 selected 'C'
top candidates: 'C' logprob -0.226568833, 'c' logprob -1.59836829

The exact/quality ROCm path also selected uppercase C for the same prompt:

step 1 selected 'C'
top candidates: 'C' logprob -0.495503277, 'c' logprob -0.941311657

Speed smoke

Command:

./ds4-bench \
  -m ds4flash.gguf \
  --cuda \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 8192 \
  --step-incr 2048 \
  --gen-tokens 128

Result:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,57.76,128,7.63,52184460
4096,2048,46.47,128,7.59,80373132
6144,2048,39.49,128,7.37,108561804
8192,2048,34.26,128,7.41,136750476

These are usable but below the better Strix Halo ROCm numbers reported in #16, which is expected for this branch-sync PR because it deliberately does not include the later ROCm optimization work.

Additional review

Read-only adversarial reviews were run with DeepSeek V4, Cursor Composer 2.5, and Gemini 3.5 Flash. They did not identify a branch-sync merge blocker. Their actionable feedback is reflected here:

added the explicit wave64 GCN/CDNA compile guard
documented the HIP shared-memory opt-in semantic difference in the shim
made the model-dependent test evidence and remaining gaps explicit in this PR body

Draft blockers / remaining gaps

./ds4_test --logprob-vectors does not fully pass on ROCm because of the short_code_completion C vs c mismatch.
ROCm performance is below the stronger issue Support for AMD GPU (ROCm/HIP) backend #16 benchmark reports.
CUDA regression testing on NVIDIA hardware was not run.
MTP/speculative decoding was not validated here; that is intentionally separate from this branch sync and overlaps with Fix ROCm build/runtime naming and MTP model mapping #156.
Full 65K benchmark sweep was not run; the speed smoke covers 2K-8K plus the passing 30K-token long-context correctness test.

Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.

Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.

Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.

Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.

Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)

Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.

Return a 400 error with error type "context_exceeded" when prompt tokens exceed context size. The response includes both n_prompt_tokens and n_ctx fields so clients can determine exactly why the request failed and how far over the limit they went. Error response format: { "error": { "message": "Prompt tokens (N) exceeds context size (M)", "type": "context_exceeded", "n_prompt_tokens": N, "n_ctx": M } }

dwarfstar is typoed to drawfstar

fix typo in readme

Squashes Ivan Fioravanti's Metal4/M5 Neural Acceleration scaffold, benchmark tooling, drift diagnostics, eval trace regrading, and initial Tensor/MPP kernel work.

Squashes Salvatore Sanfilippo's Metal4/NAX speedups and cleanup work: direct-RHS dense Tensor kernels, routed MoE Tensor coverage, full-512 indexer preservation, NAX prefill indexer scores, indexed-attention half shadow cache, and removal of rejected experimental paths.

# Conflicts: # tests/ds4_test.c

Full routed-MoE TensorOps enabled the gate, up, and down projections. The regression was isolated to the gate projection: enabling TensorOps for gate is sufficient to send a sensitive AIME continuation into a repeated wrong answer, while TensorOps for up+down remains stable. The kernel-side cause is small but real arithmetic drift in mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A same-input routed-MoE probe showed no address/layout corruption: TensorOps gate was close to legacy, but not bit-identical. An isolated same-tile primitive probe confirmed the source outside DS4 routing and quantization: legacy simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on the tested tile, while TensorOps produced close nonzero FP32 differences. MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the drift. That normally tiny drift matters here because MoE routing has discontinuous top-k expert selection. In the failing path the first observed safe-vs-full routing change was layer 3, token row 11: the selected sixth expert changed from 96 to 50 across a margin of only about 8e-4. Once an expert changes, the transformer state is no longer a smooth local perturbation, and autoregressive decoding can fall into a bad repetition basin. Attempts that preserved the full gate TensorOps speed did not produce a zero-drift or stable fix: forcing the routed intermediate to F32, using the older generic TensorOps routed matmul instead of the expert-major fast layout, changing the TensorOps descriptor mode, and compiling with strict Metal math all left the gate drift or the bad continuation in place. Retaining TensorOps for up and down keeps most of the MoE speedup, but gate stays on the legacy path because it feeds the nonlinear silu(gate) * up branch and is the projection that can flip later router decisions.

Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story. The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently.

gundemirbas · 2026-05-24T03:05:18Z

Unfortunately I have still no way to test it. If everything is fine, I'll just merge it.

i can provide full access to a beelink gtr9 pro 128GB 8060

24601 · 2026-05-24T03:18:08Z

I am going to test this a bit further after I verify some power management settings on the GPU on my device, it's an ASUS Flow Z13 128GB (CachyOS), thermals are constrained on it and TDP is in the 70s and you can push it into the 80s, and I think there's some interesting issues with that maybe artifacts on throughput.

Verifying tonight/tomorrow AM (Sunday)

…20260521-pr

WebReflection · 2026-05-24T08:28:39Z

Unfortunately I have still no way to test it. If everything is fine, I'll just merge it.

it definitively builds and work, it looks like a solid base to eventually improve later.

mitsuhiko and others added 30 commits May 11, 2026 12:30

feat(server): report KV cache usage

0ca2e28

feat(server): report Anthropic cache usage

38800bf

README: separate motivations.

c5ef7ac

Merge branch 'pr-91-responses' into responses-api

2174611

Tighten Responses tool_search replay

6396966

Fix Responses tool checkpoint cache reuse

a01bf1d

Fix Responses API live continuation

acb40bf

metal: cover q4 expert tensors in model views

2a7a5f3

Skip tool checkpoint canonicalization for exact DSML replay

b4c5f7c

Merge responses-api

e88a71e

Use visible live checkpoints for toolless thinking

5453ad0

Clarify server progress logs

646798f

Add Anthropic live tool continuation

43535e1

Revert "metal: cover q4 expert tensors in model views"

67e6146

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Tag Responses API server logs

0083475

Recover Responses replays without hidden reasoning

0610591

Stream Anthropic tool calls live

94c1f38

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

fix typo in readme

741d0cc

dwarfstar is typoed to drawfstar

Merge pull request antirez#155 from kernelzeroday/main

98593ec

fix typo in readme

Fix typos in README.md

f6fa52b

Merge branch 'pr-150-context-error' into merge-pr-150-standard-context

157873b

user and others added 25 commits May 23, 2026 11:22

Handle malformed DSML with model retry

c956c83

Metal Neural Acceleration initial implementation

63ceed6

Squashes Ivan Fioravanti's Metal4/M5 Neural Acceleration scaffold, benchmark tooling, drift diagnostics, eval trace regrading, and initial Tensor/MPP kernel work.

Merge changes from dgx182 main

145d0f2

Merge changes from 192.168.1.57

8d45bc4

# Conflicts: # tests/ds4_test.c

Store Metal attention compressed KV cache in F16

23e1ea5

Preserve CUDA compressed KV write path

2d5527f

Fix compiler warnings in agent and TUI code

7cea333

Fix CUDA batched MoE prefill signature

54be8db

Fix agent session restore history

bb54379

Merge PR antirez#192 DSML repair fixes

fb2eff7

Merge remote-tracking branch 'mac57/main'

e34a48a

Add fine-grained prefill progress callback

fc1450d

Fix agent prefill progress bar sizing

9ae1eeb

Highlight agent code output

e792c55

Fix F16 routed MoE graph dumps

a05e69e

Add GPU power throttling

444afce

Simplify agent edit tools

df22245

Add runtime power commands

f398aa3

Guard attention output TensorOps full tiles

57ae485

Apply agent power changes while busy

4e867d6

Improve agent edit tooling

d447bdb

Fix anchored edit tail matching

a1f30a6

Merge remote-tracking branch 'origin/main' into codex/rocm-sync-main-…

8ebd79f

…20260521-pr

Polish ROCm sync after main merge

fc0d048

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: sync branch with main#219

rocm: sync branch with main#219
24601 wants to merge 123 commits into
antirez:rocmfrom
24601:codex/rocm-sync-main-20260521

24601 commented May 21, 2026 •

edited

Loading

Uh oh!

gundemirbas commented May 24, 2026

Uh oh!

24601 commented May 24, 2026 •

edited

Loading

Uh oh!

WebReflection commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

24601 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relation to #16

Scope notes

Validation environment

Build and unit validation

Model-backed validation

Speed smoke

Additional review

Draft blockers / remaining gaps

Uh oh!

gundemirbas commented May 24, 2026

Uh oh!

24601 commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WebReflection commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

24601 commented May 21, 2026 •

edited

Loading

24601 commented May 24, 2026 •

edited

Loading