Skip to content

rocm: sync branch with main#219

Draft
24601 wants to merge 123 commits into
antirez:rocmfrom
24601:codex/rocm-sync-main-20260521
Draft

rocm: sync branch with main#219
24601 wants to merge 123 commits into
antirez:rocmfrom
24601:codex/rocm-sync-main-20260521

Conversation

@24601
Copy link
Copy Markdown

@24601 24601 commented May 21, 2026

Summary

This brings the rocm branch forward to current main (8d576642) while keeping the ROCm patch scoped to the existing community ROCm backend work.

Current state before this change:

  • rocm was based at 7a751eb
  • rocm was 85 commits behind main and 1 commit ahead

What changed:

  • merge current main into rocm with conflicts resolved
  • keep the ROCm backend support buildable against the current DS4 GPU code
  • add a make rocm target that builds the current main binaries through hipcc
  • keep HIP out of CUDA WMMA kernels and use the generic indexer-score fallback on ROCm
  • explicitly reject known wave64 GCN/CDNA ROCm targets for now, because DS4 GPU kernels still assume CUDA-style 32-lane warp math

This intentionally does not include the user-facing --rocm naming/MTP mapping work from #156 or the ROCm WMMA/indexer optimization work from #180.

Relation to #16

This is related to #16, but it should not close #16.

The goal here is to make the existing upstream rocm branch current and testable again on current main. It is not the later ROCm optimization work discussed in #16, and the speed numbers below show why this should stay draft until the remaining correctness/performance gaps are understood.

Scope notes

I checked prior ROCm-related PRs before keeping this narrowly scoped:

This PR is therefore only a branch sync plus the minimum build fixes needed for current main.

Validation environment

  • AMD Ryzen AI Max+ 395 / Radeon 8060S
  • ROCm target: gfx1151
  • HIP/ROCm toolchain: HIP 7.2.53211-364a905
  • Backend selected by DS4: cuda API path over ROCm/HIP
  • Model: official project q2-imatrix GGUF, DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
  • Model size reported by DS4: 80.76 GiB, 284.33B logical parameters
  • PR head tested: e13de731787799310e53bffe98b1f077f7c96b50

Build and unit validation

Commands run:

make clean
make rocm ROCM_ARCH=gfx1151 -j$(nproc)
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx1151 -j$(nproc)
./ds4_test --server
make clean
make cpu -j$(nproc)

Results:

  • ROCm build: PASS
  • ROCm-linked ds4_test --server: PASS (server: OK, ds4 tests: ok)
  • CPU build: PASS

Negative target guard check:

make clean
make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx90a -j$(nproc)

Result: expected compile failure with the new wave64 GCN/CDNA unsupported-target error.

Existing warnings observed during the builds:

  • ds4_server.c: const-discard warning in stop_list_find_from
  • ds4_agent.c: existing snprintf truncation warnings

Model-backed validation

Model load / inspect:

./ds4 --inspect --cuda -m ds4flash.gguf

Result: PASS. DS4 loaded the official q2-imatrix GGUF and initialized the ROCm/HIP backend.

Short deterministic generation:

./ds4 --cuda -m ds4flash.gguf --ctx 4096 --nothink --temp 0 -n 32 -p 'Reply exactly: ROCm OK'

Result: PASS. Output was ROCm OK.

Reported speed for this tiny prompt:

prefill: 0.46 t/s, generation: 8.96 t/s

Tool-call quality:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --tool-call-quality

Result: PASS in both fast and exact paths.

Long-context story recall:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --long-context

Result: PASS (long-context: OK, ds4 tests: ok).

Runtime note: this run took about one hour wall time on the validation machine. Progress markers reached 8192/30474, 16384/30474, 24576/30474, and 30474/30474; correctness passed, but throughput is not yet competitive with the best ROCm numbers discussed in #16.

Official logprob vectors:

DS4_TEST_MODEL=ds4flash.gguf ./ds4_test --logprob-vectors

Result: FAIL with one mismatch:

ds4-test: vector short_code_completion step 1 selected token mismatch
logprob-vectors: ERR
ds4 tests: 1 failure(s)

Manual dump for the failing prompt showed ROCm selecting uppercase C after the opening code fence while the fixture expects lowercase c:

step 1 selected 'C'
top candidates: 'C' logprob -0.226568833, 'c' logprob -1.59836829

The exact/quality ROCm path also selected uppercase C for the same prompt:

step 1 selected 'C'
top candidates: 'C' logprob -0.495503277, 'c' logprob -0.941311657

Speed smoke

Command:

./ds4-bench \
  -m ds4flash.gguf \
  --cuda \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 8192 \
  --step-incr 2048 \
  --gen-tokens 128

Result:

ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,57.76,128,7.63,52184460
4096,2048,46.47,128,7.59,80373132
6144,2048,39.49,128,7.37,108561804
8192,2048,34.26,128,7.41,136750476

These are usable but below the better Strix Halo ROCm numbers reported in #16, which is expected for this branch-sync PR because it deliberately does not include the later ROCm optimization work.

Additional review

Read-only adversarial reviews were run with DeepSeek V4, Cursor Composer 2.5, and Gemini 3.5 Flash. They did not identify a branch-sync merge blocker. Their actionable feedback is reflected here:

  • added the explicit wave64 GCN/CDNA compile guard
  • documented the HIP shared-memory opt-in semantic difference in the shim
  • made the model-dependent test evidence and remaining gaps explicit in this PR body

Draft blockers / remaining gaps

  • ./ds4_test --logprob-vectors does not fully pass on ROCm because of the short_code_completion C vs c mismatch.
  • ROCm performance is below the stronger issue Support for AMD GPU (ROCm/HIP) backend #16 benchmark reports.
  • CUDA regression testing on NVIDIA hardware was not run.
  • MTP/speculative decoding was not validated here; that is intentionally separate from this branch sync and overlaps with Fix ROCm build/runtime naming and MTP model mapping #156.
  • Full 65K benchmark sweep was not run; the speed smoke covers 2K-8K plus the passing 30K-token long-context correctness test.

mitsuhiko and others added 30 commits May 11, 2026 12:30
Implements the Responses API endpoint that Codex CLI (and other modern
OpenAI tooling) speaks instead of /v1/chat/completions. The wire format
is documented in OpenAI's Responses API; this implementation has been
iterated against the Codex CLI binary's SSE parser shape until no
remaining schema gaps were found.

Request parsing (parse_responses_request, parse_responses_input):
- Accepts the typed input array (message, function_call,
  function_call_output, reasoning, custom_tool_call(_output),
  local_shell_call(_output), web_search_call(_output),
  tool_search_call(_output), image_generation_call(_output),
  compaction, context_compaction).
- Maps hosted-tool history to function_call/function_call_output so
  prior actions survive across turns; rejects unknown item types and
  non-completed status with 400 to avoid silent context loss.
- Strict content-array parsing: only string|null|array of recognized
  text blocks (input_text/output_text/text/summary_text/
  reasoning_text); rejects non-text modalities (input_image/file/
  audio) instead of accepting an empty prompt.
- Merges adjacent function_call items into the preceding assistant
  message so text + tool-call turns render as a single assistant
  block.
- Honors reasoning.effort (incl. "minimal"/"none") and gates
  reasoning summary surface on reasoning.summary opt-in.
- Rejects previous_response_id, conversation, and forced tool_choice
  explicitly (constrained decoding / persisted state not supported).

Output (responses_sse_*, responses_final_response):
- Emits the full streaming lifecycle: response.created,
  output_item.added/.done, reasoning_summary_part.added/.done,
  reasoning_summary_text.delta/.done, content_part.added/.done,
  output_text.delta/.done, function_call_arguments.delta/.done,
  response.completed.
- Branches the terminal event by finish reason: response.failed for
  errors and response.incomplete with reason "max_tokens" for length.
- Every event carries sequence_number; every output_text part carries
  annotations:[]; function_call output_item.added ships with an empty
  arguments string (full args arrive via function_call_arguments.done
  and output_item.done), and item ids are stable across added/done.
- Tracks whether </think> was actually observed so a truncated stream
  marks the reasoning item incomplete instead of "completed".
- Recovers gracefully when the DSML tool parse fails after the model
  was suppressed at the tool marker: the suppressed tail is flushed
  as additional output_text deltas so the streamed message matches
  output_item.done.

Tested by 25 rounds of /codex:adversarial-review against the same
client this is meant to feed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks.

Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.
This reverts commit 2a7a5f3.

There was no ack from the user. Don't want to take a fix
that is astronautically produced from an unclear error
trace.
Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth.

Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.
Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation.

The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance.

Tested on 0.180 with:
- make cpu
- make -B cuda-spark
- make cuda-regression
- ./ds4_test --server --metal-kernels
- ./ds4_test --logprob-vectors --tool-call-quality
- ds4-bench ctx-alloc 32768, 250000, and 1000000
- ds4-server --ctx 1000000 startup smoke

(cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)
Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first.

Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations.

Fixes antirez#127.
Return a 400 error with error type "context_exceeded" when prompt tokens exceed
context size. The response includes both n_prompt_tokens and n_ctx fields so
clients can determine exactly why the request failed and how far over the limit
they went.

Error response format:
  {
    "error": {
      "message": "Prompt tokens (N) exceeds context size (M)",
      "type": "context_exceeded",
      "n_prompt_tokens": N,
      "n_ctx": M
    }
  }
dwarfstar is typoed to drawfstar
user and others added 25 commits May 23, 2026 11:22
Squashes Ivan Fioravanti's Metal4/M5 Neural Acceleration scaffold, benchmark tooling, drift diagnostics, eval trace regrading, and initial Tensor/MPP kernel work.
Squashes Salvatore Sanfilippo's Metal4/NAX speedups and cleanup work: direct-RHS dense Tensor kernels, routed MoE Tensor coverage, full-512 indexer preservation, NAX prefill indexer scores, indexed-attention half shadow cache, and removal of rejected experimental paths.
# Conflicts:
#	tests/ds4_test.c
Full routed-MoE TensorOps enabled the gate, up, and down projections. The
regression was isolated to the gate projection: enabling TensorOps for gate is
sufficient to send a sensitive AIME continuation into a repeated wrong answer,
while TensorOps for up+down remains stable.

The kernel-side cause is small but real arithmetic drift in
mpp::tensor_ops::matmul2d relative to the legacy simdgroup MMA contraction. A
same-input routed-MoE probe showed no address/layout corruption: TensorOps gate
was close to legacy, but not bit-identical. An isolated same-tile primitive
probe confirmed the source outside DS4 routing and quantization: legacy
simdgroup_multiply_accumulate matched a CPU FP32 serial dot-product reference on
the tested tile, while TensorOps produced close nonzero FP32 differences.
MTLMathModeSafe and the tested TensorOps descriptor variant did not remove the
drift.

That normally tiny drift matters here because MoE routing has discontinuous
top-k expert selection. In the failing path the first observed safe-vs-full
routing change was layer 3, token row 11: the selected sixth expert changed from
96 to 50 across a margin of only about 8e-4. Once an expert changes, the
transformer state is no longer a smooth local perturbation, and autoregressive
decoding can fall into a bad repetition basin.

Attempts that preserved the full gate TensorOps speed did not produce a
zero-drift or stable fix: forcing the routed intermediate to F32, using the
older generic TensorOps routed matmul instead of the expert-major fast layout,
changing the TensorOps descriptor mode, and compiling with strict Metal math all
left the gate drift or the bad continuation in place. Retaining TensorOps for
up and down keeps most of the MoE speedup, but gate stays on the legacy path
because it feeds the nonlinear silu(gate) * up branch and is the projection that
can flip later router decisions.
Remove the routed-MoE TensorOps/NAX path completely instead of leaving it as a gated-off mode. Semantic evals showed that gate, up, and down TensorOps routed-MoE variants can each move the model into bad continuations, while the full-tile-only expert-major experiment was correctness-interesting but slower than the legacy simdgroup path. Keeping the dead kernels around risks accidental re-enablement without a trustworthy correctness story.

The routed MoE grouped matmul now always uses the legacy 32-token expert-major simdgroup kernel. Other Metal4 TensorOps paths, such as the attention-output projection, remain enabled independently.
@gundemirbas
Copy link
Copy Markdown

Unfortunately I have still no way to test it. If everything is fine, I'll just merge it.

i can provide full access to a beelink gtr9 pro 128GB 8060

@24601
Copy link
Copy Markdown
Author

24601 commented May 24, 2026

I am going to test this a bit further after I verify some power management settings on the GPU on my device, it's an ASUS Flow Z13 128GB (CachyOS), thermals are constrained on it and TDP is in the 70s and you can push it into the 80s, and I think there's some interesting issues with that maybe artifacts on throughput.

Verifying tonight/tomorrow AM (Sunday)

@WebReflection
Copy link
Copy Markdown

Unfortunately I have still no way to test it. If everything is fine, I'll just merge it.

it definitively builds and work, it looks like a solid base to eventually improve later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.