Skip to content

Releases: Maxritz/ollama-ROCM

Ollama — RDNA4 gfx1201 + DARS v2.0 Fork

13 Jun 14:51

Choose a tag to compare

A clean, traceable, single-branch fork of Ollama with native AMD Radeon RX 9070 XT (gfx1201 / RDNA4) optimizations and the DARS v2.0 Dynamic Attractor Routing System built directly into the source.
No patch layering. No v3/v4 mess. No mega patches.

Note

Vulkan Backend Integration Complete!
We have successfully optimized, benchmarked, and merged the Vulkan backend for the AMD RX 9070 XT. The implementation introduces Wave32 RDNA4 optimizations, Flash Attention support, and dynamic library loader resolutions. Side-by-side benchmark results are posted in the Performance Benchmarks section below.

Target: AMD RX 9070 XT (gfx1201 / RDNA4, 16 GB VRAM) · Windows 11 · ROCm 7.1 / Vulkan

OLLAMA ROCM rdna4-gfx1201 v0.1

23 May 14:41

Choose a tag to compare

This is a highly optimized build of Ollama tailored for AMD RDNA4 architecture (specifically RX 9070 XT). It includes 20 specific optimizations such as Paged KV Cache, Split-K Matmul, MoE Top-K routing, RoPE Cache, and TurboQuant.

Benchmarks (8B Q8_0 model on RX 9070 XT)

Prefill Rate: ~3463 tokens/sec
Generate Rate: ~78 tokens/sec
Time to First Token (TTFT): ~130ms

Full Changelog: https://github.com/Maxritz/ollama-ROCM/commits/v0.1

Complete optimization suite for AMD Radeon RX 9070 XT (gfx1201) targeting ROCm 7.1.

HIP Backend Extensions (ggml_hip_ext.cu)

  • Paged KV cache with LRU eviction and device memory pools
  • MoE top-k routing kernel (fused softmax + selection)
  • Split-K matrix multiply with rocBLAS auto-tuning
  • RoPE/YaRN cache (32K precomputed cos/sin in fp16)
  • Q8_0 quantized KV cache + fused attention kernel
  • Persistent batch buffers (zero-allocation decode)
  • Speculative N-gram decoding predictor
  • Async H2D upload ring buffer with HIP events
  • Q4_K fused dequantization kernel

TurboQuant KV (ggml_turboquant.cu)

  • Walsh-Hadamard Transform (WHT) decorrelation
  • Lloyd-Max optimal quantization (2/3/4-bit)
  • QJL (Quantized Johnson-Lindenstrauss) projection for K-cache
  • Fused WHT+quantize and dequant+IWHT kernels
  • Format: TBQ2_0, TBQ3_0/1/2, TBQ4_0/1/2, TBQP3_0/1/2

rocWMMA Fixes

  • WMMA warp mask corrections for gfx1201 Wave32 mode
  • fattn-mma-f16 cooperative matrix tile alignment fixes

Tested on: AMD RX 9070 XT, ROCm 7.1, Windows 11, LLVM 23.0
Build: cmake -DAMDGPU_TARGETS=gfx1201 -DGGML_HIP_ROCWMMA_FATTN=ON -DGGML_HIP_TURBOQUANT=ON