Releases: Maxritz/ollama-ROCM
Ollama — RDNA4 gfx1201 + DARS v2.0 Fork
A clean, traceable, single-branch fork of Ollama with native AMD Radeon RX 9070 XT (gfx1201 / RDNA4) optimizations and the DARS v2.0 Dynamic Attractor Routing System built directly into the source.
No patch layering. No v3/v4 mess. No mega patches.
Note
Vulkan Backend Integration Complete!
We have successfully optimized, benchmarked, and merged the Vulkan backend for the AMD RX 9070 XT. The implementation introduces Wave32 RDNA4 optimizations, Flash Attention support, and dynamic library loader resolutions. Side-by-side benchmark results are posted in the Performance Benchmarks section below.
Target: AMD RX 9070 XT (gfx1201 / RDNA4, 16 GB VRAM) · Windows 11 · ROCm 7.1 / Vulkan
OLLAMA ROCM rdna4-gfx1201 v0.1
This is a highly optimized build of Ollama tailored for AMD RDNA4 architecture (specifically RX 9070 XT). It includes 20 specific optimizations such as Paged KV Cache, Split-K Matmul, MoE Top-K routing, RoPE Cache, and TurboQuant.
Benchmarks (8B Q8_0 model on RX 9070 XT)
Prefill Rate: ~3463 tokens/sec
Generate Rate: ~78 tokens/sec
Time to First Token (TTFT): ~130ms
Full Changelog: https://github.com/Maxritz/ollama-ROCM/commits/v0.1
Complete optimization suite for AMD Radeon RX 9070 XT (gfx1201) targeting ROCm 7.1.
HIP Backend Extensions (ggml_hip_ext.cu)
- Paged KV cache with LRU eviction and device memory pools
- MoE top-k routing kernel (fused softmax + selection)
- Split-K matrix multiply with rocBLAS auto-tuning
- RoPE/YaRN cache (32K precomputed cos/sin in fp16)
- Q8_0 quantized KV cache + fused attention kernel
- Persistent batch buffers (zero-allocation decode)
- Speculative N-gram decoding predictor
- Async H2D upload ring buffer with HIP events
- Q4_K fused dequantization kernel
TurboQuant KV (ggml_turboquant.cu)
- Walsh-Hadamard Transform (WHT) decorrelation
- Lloyd-Max optimal quantization (2/3/4-bit)
- QJL (Quantized Johnson-Lindenstrauss) projection for K-cache
- Fused WHT+quantize and dequant+IWHT kernels
- Format: TBQ2_0, TBQ3_0/1/2, TBQ4_0/1/2, TBQP3_0/1/2
rocWMMA Fixes
- WMMA warp mask corrections for gfx1201 Wave32 mode
- fattn-mma-f16 cooperative matrix tile alignment fixes
Tested on: AMD RX 9070 XT, ROCm 7.1, Windows 11, LLVM 23.0
Build: cmake -DAMDGPU_TARGETS=gfx1201 -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_TURBOQUANT=ON