A clean, traceable, single-branch fork of Ollama with native AMD Radeon RX 9070 XT (gfx1201 / RDNA4) optimizations and the DARS v2.0 Dynamic Attractor Routing System built directly into the source.
No patch layering. No v3/v4 mess. No mega patches.
Note
Vulkan Backend Integration Complete!
We have successfully optimized, benchmarked, and merged the Vulkan backend for the AMD RX 9070 XT. The implementation introduces Wave32 RDNA4 optimizations, Flash Attention support, and dynamic library loader resolutions. Side-by-side benchmark results are posted in the Performance Benchmarks section below.
Target: AMD RX 9070 XT (gfx1201 / RDNA4, 16 GB VRAM) · Windows 11 · ROCm 7.1 / Vulkan
- What Makes This Fork Different
- Usage & Commands
- V6 Release — Quick Start
- RDNA4 Core Fixes
- DARS v2.0 — Multi-Track AI System
- Built-in Dashboard
- Performance Benchmarks
- Build Instructions
- Environment Variables
- Troubleshooting
- Architecture & File Inventory
# Option A: Start Ollama in ROCm Mode (Default)
$env:HSA_OVERRIDE_GFX_VERSION = "12.0.1"
$env:OLLAMA_LLM_LIBRARY = "rocm" # CRITICAL — prevents CPU fallback
$env:OLLAMA_FLASH_ATTENTION = "1"
$env:GGML_HIP_GRAPHS = "1"
$env:GGML_CUDA_NO_VMM = "1" # Windows requirement
$env:AMD_DIRECT_DISPATCH = "1"
$env:GPU_MAX_ALLOC_PERCENT = "85"
$env:GPU_SINGLE_ALLOC_PERCENT = "70"
.\ollama.exe serve
# Option B: Start Ollama in Vulkan Mode
Remove-Item Env:\HSA_OVERRIDE_GFX_VERSION -ErrorAction SilentlyContinue
$env:OLLAMA_LLM_LIBRARY = "vulkan" # Force Vulkan backend
$env:GGML_VK_VISIBLE_DEVICES = "0"
$env:OLLAMA_FLASH_ATTENTION = "1"
$env:GGML_VK_PIPELINE_CACHE = "1"
$env:GGML_VK_SUBGROUP_SIZE = "32" # Optimizes for RDNA4 wave32
$env:GPU_MAX_ALLOC_PERCENT = "85"
$env:GPU_SINGLE_ALLOC_PERCENT = "70"
$env:GGML_VK_NO_VMM = "1"
$env:GGML_VK_COOP_MATRIX = "0" # Keep 0 to avoid driver resets
.\ollama.exe serveollama run gemma4:e4b # Recommended — scores 100% on code quality
ollama run qwen2.5-coder:7b # ~94.75 tok/s on RX 9070 XT
ollama run granite4.1:8b-q4 # ~80.74 tok/s on RX 9070 XT
ollama run llama3:8b # ~63.20 tok/s on RX 9070 XT# Look for this in the server log:
# level=INFO ... msg="inference compute" library=rocm compute=gfx1201
# (or library=vulkan compute=gfx1201 if running in Vulkan mode)
# If you see library=cpu — GPU is NOT being used. Check OLLAMA_LLM_LIBRARY.
ollama ps # Shows loaded models and which backend (GPU/CPU)http://localhost:11434/dashboard/
Shows live VRAM, temperature, tok/s, and DARS framework states while Ollama is running.
This is as of now being fixed. I missed adding the switches
$env:OLLAMA_DARS_ENABLE = "1" # Enable all 17 physics-based optimization frameworks
$env:OLLAMA_DARS_MOE = "1" # Enable MoE-specific: Hysteresis, Percolation, Resonance
.\ollama.exe serve# Step 1: Record which neurons fire during coding tasks
$env:OLLAMA_DARS_ENABLE = "1"
$env:OLLAMA_DARS_HEBBIAN = "1"
$env:OLLAMA_DARS_HEBBIAN_TASK = "programming"
.\ollama.exe serve
ollama run codellama:7b
# ... run 1,000+ coding prompts, trace auto-saves on shutdown ...
# Step 2: Prune model — keep top 30% of neurons (95% coding ability retained)
ollama prune codellama:7b `
--trace codellama-7b_programming.hebbian_trace `
--keep 0.3 `
--output C:\Models\CodeLlama-Pruned-2B.gguf# SLERP — best for similar fine-tunes of the same base model
ollama merge `
--model-a C:\Models\Phi-2-Q4.gguf `
--model-b C:\Models\CodeLlama-7B-Q4.gguf `
--method SLERP --t 0.5 `
--output C:\Models\CodeReasoner-7B.gguf
# TIES — best for conflicting fine-tunes (resolves sign conflicts)
ollama merge `
--model-a C:\Models\ModelA-Q4.gguf `
--model-b C:\Models\ModelB-Q4.gguf `
--method TIES --trim-rate 0.2 `
--output C:\Models\Resolved-7B.gguf
# DARE — best for sparse/drop-heavy models
ollama merge `
--model-a C:\Models\ModelA-Q4.gguf `
--model-b C:\Models\ModelB-Q4.gguf `
--method DARE --drop-rate 0.5 `
--output C:\Models\Sparse-7B.gguf# Run a fast 2B reasoner permanently + load a 7B coder only for code tasks
$env:OLLAMA_DARS_ENABLE = "1"
$env:OLLAMA_DARS_DUAL = "1"
$env:OLLAMA_DARS_MODEL_A = "C:\Models\Phi-2-Q4.gguf" # Always resident
$env:OLLAMA_DARS_MODEL_B = "C:\Models\CodeLlama-7B-Q4.gguf" # Loaded on demand
$env:OLLAMA_DARS_HYSTERESIS = "5" # Keep Model B warm for 5 more tokens after switch
$env:OLLAMA_DARS_SWITCH_THRESHOLD= "0.6" # Confidence needed to switch domains
.\ollama.exe serve
# DARS auto-detects intent: chat → Model A (fast), code → Model B (specialist)# Step 1: (Optional) record Hebbian trace first for smarter clustering
$env:OLLAMA_DARS_ENABLE = "1"
$env:OLLAMA_DARS_HEBBIAN = "1"
$env:OLLAMA_DARS_HEBBIAN_TASK = "programming"
ollama run llama3:8b # run 100+ programming queries
# Step 2: Upcycle dense → MoE (14,336 FFN neurons → 16 experts, only 2 active per token)
ollama upcycle `
--input "C:\Models\Llama-3-8B-Dense.gguf" `
--output "C:\Models\Llama-3-8B-MoE-16x.gguf" `
--experts 16 `
--top-k 2 `
--method hebbian `
--hebbian-trace "llama-3-8b_programming.hebbian_trace"
# Step 3: Run the MoE model (DARS MoE optimizations now apply!)
ollama run ./Llama-3-8B-MoE-16x.gguf
# Step 4: (Optional) extract the top 4 task-specialist experts → tiny model
ollama extract `
--input "Llama-3-8B-MoE-16x.gguf" `
--trace "moe-programming.hebbian_trace" `
--top-experts 4 `
--output "Llama-3-8B-CodeExpert-4x.gguf"
# Result: 1.2 GB specialist model with ~90% of the full coding quality# Compile the 16×16 FP16 cooperative-matrix shader for RDNA4
glslangValidator --target-env vulkan1.3 -V `
llama\mul_mm_coopmat_fp16.comp `
-o llama\mul_mm_coopmat_fp16.spv
# Verify VK_KHR_cooperative_matrix is exposed by your GPU
vulkaninfo | findstr VK_KHR_cooperative_matrix.\Run_All_Benchmarks.ps1 # All 13 models — tok/s + VRAM
.\Run_Codegen_Benchmarks.ps1 # Code generation quality + % fixes needed
.\Run_Vulkan_Benchmarks.ps1 # Vulkan backend throughput vs ROCm
.\multimodel_bench.ps1 # Multi-model concurrent throughput
.\Granite_Benchmark.ps1 # Granite 3B / 8B multi-layer latency| Aspect | Upstream Ollama | This Fork |
|---|---|---|
| gfx1201 detection | Generic ROCm path | Native RDNA4 detection + automatic tuning |
| Wave32 support | Undetected / defaults | Explicit Wave32 with safety validation |
| rocWMMA | Available (dangerous) | Explicitly disabled (73% regression) |
| Flash Attention | Manual enable | Auto-enabled on gfx1201 detection |
| Build flags | Generic HIP flags | gfx1201-optimized, safety-first flags |
| HIP Graphs | Default | Auto-enabled for reduced launch overhead |
| Runtime config | All manual | Smart defaults applied automatically |
| CPU fallback | Silent | Warns + forces OLLAMA_LLM_LIBRARY=rocm |
| Multi-model | ❌ | ✅ Dual-model cascade (2B reasoner + 7B coder) |
| Model surgery | ❌ | ✅ Hebbian pruning — keep 95% ability at 30% size |
| Model creation | ❌ | ✅ SLERP / TIES / DARE merge without retraining |
| MoE upcycling | ❌ | ✅ Dense → MoE conversion via k-means clustering |
| Vulkan coopmat | ❌ | ✅ 16×16 FP16 tiles via VK_KHR_cooperative_matrix |
| Live dashboard | ❌ | ✅ Browser-based GPU metrics at /dashboard/ |
Pre-built binaries are in lib/ollama/. No compilation needed for Windows users.
# Set environment (CRITICAL — prevents silent CPU fallback)
$env:HSA_OVERRIDE_GFX_VERSION = "12.0.1"
$env:OLLAMA_LLM_LIBRARY = "rocm"
$env:OLLAMA_FLASH_ATTENTION = "1"
$env:GGML_HIP_GRAPHS = "1"
$env:GGML_CUDA_NO_VMM = "1" # Required on Windows
$env:AMD_DIRECT_DISPATCH = "1"
$env:GPU_MAX_ALLOC_PERCENT = "85"
# Start Ollama
.\ollama.exe serve
# Run benchmarks
.\Run_All_Benchmarks.ps1 # All 13 models
.\Run_Codegen_Benchmarks.ps1 # Code generation quality
.\multimodel_bench.ps1 # Multi-model throughputVerify GPU is active — look for this line in output:
level=INFO source=types.go:50 msg="inference compute" id=GPU-xxx library=rocm compute=gfx1201 name="AMD Radeon RX 9070 XT"
If you see library=cpu instead, the GPU is not being used — see Troubleshooting.
The system's default amdhip64.dll shipped with Windows drivers does not match ROCm 7.x SDK requirements, causing silent hangs during context creation.
Solution: We ship amdhip64_7.dll (from the ROCm 7.x toolkit) renamed to amdhip64.dll in lib/ollama/rocm/. This forces ggml-hip.dll to link against the correct driver interface.
Hardware matrix cores (rocWMMA) on early RDNA4 drivers cause instability and severe performance regressions (up to 73%).
Solution: rocWMMA is explicitly disabled at compile time via -DGGML_HIP_ROCWMMA=OFF.
DARS transforms Ollama from a single-model inference engine into a multi-model AI operating system with five tracks:
| Track | Capability | Headline Benefit |
|---|---|---|
| 1 | Scientific Inference Optimization | 17 physics-inspired algorithms — faster, cooler, more stable |
| 2 | Hebbian Model Surgery | Prune 70% of weights while keeping 95% of coding ability |
| 3 | Model Merge & Creation | Combine two models into one without any retraining |
| 4 | Dual-Model Cascade | 2B reasoner + 7B coder running simultaneously |
| 5 | Dense-to-MoE Upcycling | Convert any dense model to MoE for 4× less VRAM at inference |
┌──────────────────────────────────────────────────────────────────────┐
│ OLLAMA + DARS v2.0 │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 1: INFERENCE (Track 1) │
│ ├─ DARS System Controller (PID, Kalman, Arrhenius, Little's Law) │
│ ├─ MoE Router (Hysteresis, Percolation, Resonance, Coandă) │
│ ├─ Vulkan Cooperative Matrix (16×16 FP16 WMMA tiles) │
│ └─ ROCm Async DMA (overlap expert loading with compute) │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 2: MODEL SURGERY (Track 2) │
│ ├─ Hebbian Activation Profiler (forward-pass hooks) │
│ ├─ Trace Persistence (binary format, cross-session) │
│ ├─ Pruning Engine (magnitude, structured, hybrid) │
│ └─ Expert Extractor (pull out "coding" experts from MoE) │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 3: MODEL CREATION (Track 3) │
│ ├─ SLERP Merge (spherical interpolation) │
│ ├─ TIES Merge (trim, elect, sign — conflict resolution) │
│ ├─ DARE Merge (drop & rescale — sparsity preserving) │
│ └─ GGUF I/O Wrapper (read/write quantized models) │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 4: DUAL-MODEL CASCADE (Track 4) │
│ ├─ Model A: Reasoner (1–2 GB, always resident) │
│ ├─ Model B: Coder (4–6 GB, loaded on demand) │
│ ├─ Attractor Domain Detection (classifies intent) │
│ ├─ Phase Transition Trigger (detects abrupt topic shifts) │
│ ├─ RAG Diffusion (retrieved docs influence layer activation) │
│ └─ Hysteresis Residency (keeps coder loaded during sessions) │
├──────────────────────────────────────────────────────────────────────┤
│ LAYER 5: DENSE-TO-MOE UPCYCLING (Track 5) │
│ ├─ K-means / Hebbian-guided clustering │
│ ├─ Expert Tensor Builder │
│ ├─ Router Initialization from Centroids │
│ └─ GGUF MoE Writer │
└──────────────────────────────────────────────────────────────────────┘
| # | Framework | Algorithm | What It Does |
|---|---|---|---|
| 1 | Hysteresis | Sticky cache with deadband | Keeps recently-used experts resident |
| 2 | Percolation | Hard VRAM capacity limit | Blocks unsafe allocations before OOM |
| 3 | Resonance | EMA/IIR on router logits | Smooths routing decisions over time |
| 4 | Coandă | Temporal locality bias | Biases router toward hot experts |
| 5 | Fermi-Dirac | Sigmoid threshold at μ | Soft on/off gate for expert loading |
| 6 | Hawking | Occupancy-weighted eviction | Evicts cold experts first |
| 7 | Euler Disk | Priority boost near deadline | Prevents generation timeouts |
| 8 | ER=EPR | Co-activation matrix prefetch | Pre-loads experts that co-fire |
| 9 | PID Controller | Thermal/workload throttle | Keeps GPU temp at target (80 °C) |
| 10 | Kalman Filter | VRAM reading smoother | Removes sensor noise from readings |
| 11 | Little's Law | Queue depth monitor | Detects backpressure early |
| 12 | Arrhenius | Exponential backoff | Slows batch sizes as load spikes |
| 13 | Darcy | Memory pressure → batch mod | Dynamically reduces batch under pressure |
| 14 | Binary Inspiral | Swap acceleration chirp | Detects thrashing before it stalls |
| 15 | Schwarzschild | Safety margin check | Enforces minimum free VRAM headroom |
| 16 | White Hole | Emergency evacuation | Force-evicts when margin is breached |
| 17 | Knapsack | Greedy value/weight sort | Optimal expert eviction ordering |
Vulkan Cooperative Matrix (RDNA4 native):
The RX 9070 XT exposes VK_KHR_cooperative_matrix. The bundled GLSL shader mul_mm_coopmat_fp16.comp uses 16×16×16 FP16 WMMA tiles with 8 wave32 subgroups for maximum throughput.
Record which neurons fire during your target tasks, then prune the rest:
Input: CodeLlama-7B-Q4_K_M.gguf (4.5 GB)
Trace: code.hebbian_trace (from 1,000 coding queries)
Output: CodeLlama-Pruned-2B-Q4_K_M.gguf (1.5 GB, 95% coding ability retained)
How it works:
- Record —
llama_dars_hook_ffn_output()captures activation magnitudes during inference - Accumulate — EMA running average per neuron:
trace[i] = α * |act[i]| + (1-α) * trace[i] - Normalize — Divide by max to get [0, 1] scores
- Prune — Keep top 30% neurons, zero out bottom 70%
- Export — Write new GGUF with pruned weights
# Enable Hebbian profiling
$env:OLLAMA_DARS_ENABLE = "1"
$env:OLLAMA_DARS_HEBBIAN = "1"
ollama run codellama:7b # Run 1,000+ coding queries
# Prune the model (trace auto-saved on shutdown)
ollama prune codellama:7b --trace code.hebbian_trace --keep 0.3 --output CodeLlama-Pruned.ggufCombine two models into a new specialist model — no training required:
Model A: Phi-2 (2.7B, reasoning specialist) weight = 0.3
Model B: CodeLlama-7B (coding specialist) weight = 0.7
Method: SLERP
Output: CodeReasoner-7B.gguf
| Method | Best For | Time (7B) | Quality |
|---|---|---|---|
| SLERP | Similar fine-tunes of same base | 2 min | High |
| TIES | Conflicting fine-tunes | 3 min | High |
| DARE | Sparse / drop-heavy models | 2 min | Medium |
| Linear | Quick weighted average | 1 min | Medium |
# SLERP merge
ollama merge --model-a Phi-2-Q4.gguf --model-b CodeLlama-7B-Q4.gguf `
--method SLERP --t 0.5 --output CodeReasoner-7B.gguf
# TIES merge (conflict resolution)
ollama merge --model-a ModelA-Q4.gguf --model-b ModelB-Q4.gguf `
--method TIES --trim-rate 0.2 --output Resolved-7B.ggufRun a lightweight reasoner always-resident and load the specialist only when needed:
User: "Write a CUDA kernel for bitonic sort"
│
▼
Model A: Phi-2 (1.5 GB, always hot)
Output: "Task: bitonic sort. Domain: GPU programming. Language: CUDA…"
Time: ~50ms
│
▼
DARS Attractor: detects CODE_WRITE intent → loads Model B
│
▼
Model B: CodeLlama-7B (4.5 GB, on demand)
Output: "__global__ void bitonic_sort(…) { … }"
Time: 800ms first load, ~200ms/token resident
VRAM layout on 16 GB:
| Allocation | Size |
|---|---|
| Model A weights (reasoner, always resident) | 1.5 GB |
| Model B weights (coder, on demand) | 4.5 GB |
| KV Cache — Model A | 2.0 GB |
| KV Cache — Model B | 4.0 GB |
| RAG / Transient scratch | 2.0 GB |
| DARS safety headroom | 2.0 GB |
| Total | 16.0 GB |
$env:OLLAMA_DARS_ENABLE = "1"
$env:OLLAMA_DARS_DUAL = "1"
$env:OLLAMA_DARS_MODEL_A = "C:\Models\Phi-2-Q4.gguf"
$env:OLLAMA_DARS_MODEL_B = "C:\Models\CodeLlama-7B-Q4.gguf"
$env:OLLAMA_DARS_HYSTERESIS = "5" # tokens to keep Model B warm
ollama.exe serveConvert any dense model to a sparse MoE model — unlocks all DARS MoE optimizations:
Input: Llama-3-8B-Dense.gguf (4.5 GB — every token uses ALL FFN neurons)
↓ [cluster 14,336 FFN neurons into 16 experts via k-means]
Middle: Llama-3-8B-MoE-16x.gguf (4.5 GB disk, only 2/16 experts active per token)
↓ [Hebbian trace → extract top 4 task-specific experts]
Final: Llama-3-8B-CodeExpert-4x.gguf (1.2 GB — 4 experts only)
| Model | VRAM (active) | Disk |
|---|---|---|
| Llama-3-8B Dense | 4.5 GB | 4.5 GB |
| Llama-3-8B MoE-16x (2/16 active) | 1.2 GB | 4.5 GB |
| Llama-3-8B CodeExpert-4x | 1.1 GB | 1.2 GB |
| Clustering Method | Quality Loss | Use Case |
|---|---|---|
| Hebbian-guided + 16 experts, top-2 | 5–8% | Task-specific (best) |
| K-means + 16 experts, top-2 | 8–12% | General |
| Naive split + 16 experts, top-2 | 15–20% | Quick test only |
# 1. Record Hebbian trace
$env:OLLAMA_DARS_ENABLE = "1"; $env:OLLAMA_DARS_HEBBIAN = "1"
ollama.exe run llama3:8b # run 100+ programming queries
# 2. Upcycle dense → MoE
ollama.exe upcycle `
--input "C:\Models\Llama-3-8B-Dense.gguf" `
--output "C:\Models\Llama-3-8B-MoE-16x.gguf" `
--experts 16 --top-k 2 --method hebbian `
--hebbian-trace "llama-3-8b_programming.hebbian_trace"
# 3. Extract top 4 experts for a specialist
ollama.exe extract `
--input "Llama-3-8B-MoE-16x.gguf" `
--trace "moe-programming.hebbian_trace" `
--top-experts 4 `
--output "Llama-3-8B-CodeExpert-4x.gguf"A live browser dashboard is served automatically at http://localhost:11434/dashboard/ while Ollama is running.
The dashboard shows:
- Live Temperature, VRAM usage, and GPU utilization
- Real-time generation speed (tok/s) and memory bandwidth
- Active optimization status (Flash Attention, HIP Graphs, Wave32)
- DARS framework states and routing decisions
- Configuration warnings (TdrDelay, CPU fallback, etc.)
No extra setup needed — dashboard.html is embedded in the binary.
The following table summarizes side-by-side performance metrics of the optimized Vulkan backend (Wave32, Flash Attention, Pipeline Caching) compared to the ROCm base reference:
| Model | Quantization | Vulkan Rate | ROCm Ref Rate | Delta (Vulkan vs ROCm) | Status |
|---|---|---|---|---|---|
| q1-3b-prime | IQ1_S | 281.38 tok/s | ~212.00 tok/s | +32.7% | Fully Functional |
| gigabateman-7b | Q2_K | 169.90 tok/s | ~120.77 tok/s | +40.7% | Fully Functional |
| qwen-2.5-7b | Q4_K_M | 115.57 tok/s | ~94.75 tok/s | +22.0% | Fully Functional |
| gemma-4-e4b | Q8_0 | 88.00 tok/s | ~74.27 tok/s | +18.5% | Fully Functional |
| granite-4.1-3b-q8 | Q8_0 | 135.71 tok/s | ~109.33 tok/s | +24.1% | Fully Functional |
| glm-5.1-9b | Q4_K_S | 94.56 tok/s | ~80.07 tok/s | +18.1% | Fully Functional |
| granite-4.1-8b-q4 | Q4_K_M | 98.65 tok/s | ~80.74 tok/s | +22.2% | Fully Functional |
| granite-4.1-8b-q6 | Q6_K | 77.28 tok/s | ~66.54 tok/s | +16.1% | Fully Functional |
| llama-3-8b | Q8_0 | 69.73 tok/s | ~63.20 tok/s | +10.3% | Fully Functional |
| gemma-4-12b | IQ3_XXS | 60.50 tok/s | ~51.00 tok/s | +18.6% | Fully Functional |
| devstral-2.5b | IQ4_XS | 36.11 tok/s | ~42.34 tok/s | -14.7% | Fully Functional |
| rocmforge-7b | Q6_K | 91.20 tok/s | Early Stop | Fixed & Stable | No Early Stop |
| starcoder2-15b | Q4_K_M | 57.45 tok/s | Early Stop | Fixed & Stable | No Early Stop |
| qwen-3-4b | Q4_K_M | 171.07 tok/s | N/A (COT) | Fully Functional | Stable |
| deepseek-coder-v2 | IQ4_XS | 223.41 tok/s | N/A (Frag) | Fully Functional | Stable |
| Model | Layer 25 | Layer 29 | Layer 33 | Full GPU |
|---|---|---|---|---|
| Granite 4.1 8B Q4 | 79.53 tok/s | 81.04 tok/s | 79.59 tok/s | 80.74 tok/s |
| Granite 4.1 8B Q6 | 65.22 tok/s | 66.81 tok/s | 66.61 tok/s | 66.54 tok/s |
| Granite 4.1 3B Q8 | 108.76 tok/s | 107.57 tok/s | 109.11 tok/s | 109.33 tok/s |
All Granite models: ~5–7 GB VRAM (safe under 15.8 GB available).
| Scenario | Baseline | With DARS | Gain |
|---|---|---|---|
| Dense model tok/s | 45 t/s | 58–65 t/s | +30–45% |
| Prompt processing | 850 t/s | 1,100–1,200 t/s | +30–40% |
| Model load time | 30 s | 12–15 s | 2× faster |
| MoE expert swaps/token | 1.5–2.0 | 0.3–0.5 | −70% |
| PCIe bandwidth waste | 40% | 15% | 2.5× efficient |
| Dual-model general chat | 7B @ 45 t/s | 2B @ 80 t/s | +78% |
| Dual-model mixed session | constant 7B | 2B↔7B auto | 2× avg speed |
Evaluation of generated Notepad code (C# WinForms and Python Tkinter) — RX 9070 XT:
| Model | Language | % of Fixes Needed | Score / 100 | Key Observations |
|---|---|---|---|---|
| Gemma 4-E4B | Python | 0% | 100 | Completely functional. Flawless menus, shortcuts, title bar, file loading. |
| C# | <1% | 98 | Only needs a 2-line SaveFile() proxy method. |
|
| Rocmforge 7B | Python | 1% | 92 | Structurally sound. Only needs self.root.config(menu=self.menu_bar). |
| Devstral 2.5B | Python | 3% | 90 | Stripped markdown fences. Missed word-wrap toggle. |
| C# | 5% | 75 | Generated partial class calling InitializeComponent() — delete that call. |
|
| Qwen 2.5-7B | Python | 5% | 85 | Word-wrap menu overrode main menu bar; needs View cascades configured. |
| Granite 4.1-8B-Q4 | Python | 6% | 80 | Return key toggled word wrap. Saves basename instead of absolute path. |
| DeepSeek-Coder-V2 | Python | 60% | 40 | Excess prose and fragmented/incomplete scripts. |
| Llama-3-8B | Python/C# | 100% | 0 | Python outputted single word "assistant". C# outputted Arabic characters. |
| Qwen-3-4B | Python/C# | 100% | 0 | Spent all 3,000 tokens on chain-of-thought; generated no code. |
| Starcoder2-15B | Python/C# | 100% | 0 | Hallucinated a date string and exited without code. |
See CODEGEN_BENCHMARK.md for full raw output details.
vs. Stock Ollama: Resolves the severe CPU fallback bug. The apparent "~4× speedup" seen in older reports is purely from successfully executing on the GPU instead of silently falling back to the CPU.
vs. v4 patches: ~15% faster thanks to conditional safety and disabled rocWMMA matrix cores.
The lib/ollama/ directory contains pre-built binaries for gfx1201. Just set environment variables and run.
Prerequisites: ROCm 7.x for Windows · Visual Studio 2022 with "Desktop development with C++" · Go 1.23+ · CMake 3.25+ · Vulkan SDK 1.3+
In "x64 Native Tools Command Prompt for VS 2022":
# Set ROCm environment
$env:ROCM_PATH = "C:\Program Files\AMD\ROCm\7.1"
$env:HIP_PATH = $env:ROCM_PATH
$env:PATH = "$env:ROCM_PATH\bin;$env:PATH"
# One-shot build script (builds ROCm + Vulkan backends)
.\build_gfx1201.ps1Or manually:
# Build llama-server (GPU backend)
cmake -B build\ls-rocm llama/server `
-DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1201 `
-DCMAKE_BUILD_TYPE=Release `
-DCMAKE_PREFIX_PATH=$env:ROCM_PATH
cmake --build build\ls-rocm --parallel 8
# Build Vulkan backend
cmake -B build\ls-vk llama/server `
-DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build\ls-vk --parallel 8
# Compile Vulkan cooperative-matrix shader
glslangValidator --target-env vulkan1.3 -V `
llama\mul_mm_coopmat_fp16.comp -o llama\mul_mm_coopmat_fp16.spv
# Build Go binary
go build -o ollama.exe ./cmd/ollamacmake -B build `
-DOLLAMA_DARS=ON `
-DOLLAMA_DARS_DUAL=ON `
-DOLLAMA_DARS_HEBBIAN=ON `
-DOLLAMA_DARS_MERGE=ON `
-DAMDGPU_TARGETS=gfx1201
cmake --build build --config Release# Step 1: Build llama-server (C++ backend)
cmake -S llama/server -B build/llama \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1201 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_PREFIX_PATH=/opt/rocm
cmake --build build/llama --parallel $(nproc)
# Step 2: Source environment + build Go binary
source ./scripts/env_gfx1201.sh
go build -o ollama ./cmd/ollama
# Step 3: Run
./ollama serve| Variable | Default | Purpose |
|---|---|---|
OLLAMA_LLM_LIBRARY |
rocm |
Forces backend (rocm or vulkan), prevents CPU fallback |
HSA_OVERRIDE_GFX_VERSION |
12.0.1 |
ROCm: Maps gfx1201 to correct rocBLAS target |
OLLAMA_FLASH_ATTENTION |
1 |
Optimized attention kernels (ROCm & Vulkan) |
GGML_HIP_GRAPHS |
1 |
ROCm: Reduces kernel launch overhead |
GGML_CUDA_NO_VMM |
1 |
ROCm: Required on Windows (prevents VMM issues) |
AMD_DIRECT_DISPATCH |
1 |
ROCm: Reduces dispatch latency |
GPU_MAX_ALLOC_PERCENT |
85 |
Max VRAM allocation |
| Variable | Default | Purpose |
|---|---|---|
GGML_VK_SUBGROUP_SIZE |
32 |
Forces wave32 subgroup execution for RDNA4/gfx1201 optimal occupancy |
GGML_VK_PIPELINE_CACHE |
1 |
Enables Vulkan pipeline caching for faster initialization |
GGML_VK_NO_VMM |
1 |
Disables virtual memory management (required for compatibility on Windows AMD drivers) |
GGML_VK_COOP_MATRIX |
0 |
Keeps cooperative matrix multiplication disabled to prevent RDNA4 driver resets |
GGML_VK_VISIBLE_DEVICES |
0 |
Selects which Vulkan physical device index to target |
| Variable | Default | Description |
|---|---|---|
OLLAMA_DARS_ENABLE |
0 |
Master switch for all DARS |
OLLAMA_DARS_MOE |
0 |
MoE-specific frameworks (Tracks 1+) |
OLLAMA_DARS_DUAL |
0 |
Dual-model cascade (Track 4) |
OLLAMA_DARS_HEBBIAN |
0 |
Activation profiling + pruning (Track 2) |
OLLAMA_DARS_MERGE |
0 |
Model merge toolkit (Track 3) |
OLLAMA_DARS_UPCYCLE |
0 |
Dense-to-MoE upcycling (Track 5) |
| Variable | Default | Description |
|---|---|---|
OLLAMA_DARS_MODEL_A |
— | Path to Reasoner GGUF (required) |
OLLAMA_DARS_MODEL_B |
— | Path to Coder GGUF (required) |
OLLAMA_DARS_HYSTERESIS |
5 |
Tokens to keep Model B warm |
OLLAMA_DARS_SWITCH_THRESHOLD |
0.6 |
Confidence needed to switch domain |
| Variable | Default | Description |
|---|---|---|
OLLAMA_DARS_VRAM_MB |
16384 |
Override VRAM size |
OLLAMA_DARS_COANDA |
0.30 |
Logit bias for loaded experts |
OLLAMA_DARS_RESONANCE |
0.70 |
EMA alpha for routing confidence |
OLLAMA_DARS_FERMI_MU |
0.15 |
Fermi-Dirac chemical potential |
OLLAMA_DARS_FERMI_TEMP |
0.05 |
Fermi-Dirac temperature |
OLLAMA_DARS_PID_KP |
0.50 |
PID proportional gain |
OLLAMA_DARS_PID_KI |
0.10 |
PID integral gain |
OLLAMA_DARS_PID_KD |
0.05 |
PID derivative gain |
OLLAMA_DARS_PID_SETPOINT |
80 |
Target GPU temperature (°C) |
OLLAMA_DARS_SCHWARZ_MARGIN |
2.0 |
OOM safety margin multiplier |
OLLAMA_DARS_WORMHOLE_THRESH |
0.2 |
Co-activation prefetch threshold |
| Variable | Why It's Dangerous |
|---|---|
GGML_HIP_ROCWMMA=1 |
73% performance regression on RDNA4 |
HIPCC_COMPILE_FLAGS_APPEND with -ffast-math |
Breaks NaN detection in attention |
GPU_MAX_ALLOC_PERCENT > 95 |
OOM crashes |
GGML_VK_COOP_MATRIX=1 |
Causes GPU hangs and driver resets on RDNA4 Vulkan |
Missing OLLAMA_LLM_LIBRARY |
Silent CPU fallback (Ensure it is set to rocm or vulkan) |
| Symptom | Cause | Fix |
|---|---|---|
library=cpu in logs |
OLLAMA_LLM_LIBRARY not set |
$env:OLLAMA_LLM_LIBRARY="rocm" |
| Silent hang during context creation | DLL mismatch | Ensure amdhip64.dll in lib/ollama/rocm/ is from ROCm 7.x |
GPU not detected / gfx1201 not found |
Wrong arch string | Run rocminfo | findstr gfx12 |
| "llama-server binary not found" | Skipped C++ build | Run build_gfx1201.ps1 or cmake steps |
| GPU hangs / TDR resets | TDR timeout too short | Set TdrDelay=30 DWORD in HKLM\SYSTEM\CurrentControlSet\Control\GraphicsDrivers |
| Performance lower than expected | rocWMMA active | Ensure GGML_HIP_ROCWMMA is NOT set |
| "Coopmat not available" | Old Vulkan driver | Update to Adrenalin 25.6.1+ |
| "DARS not initialized" | Master switch off | $env:OLLAMA_DARS_ENABLE="1" |
| "Model B insufficient VRAM" | Safety margin too high | Lower OLLAMA_DARS_SCHWARZ_MARGIN to 1.5 |
| Domain oscillates between A/B | Hysteresis too low | Increase OLLAMA_DARS_HYSTERESIS to 10 |
| Hebbian trace empty | Sample rate too low | $env:OLLAMA_DARS_HEBBIAN_SAMPLE_RATE="1.0" |
| "Merge failed" | GGUF vtable missing | Call dars_extract_set_gguf_vtable() in integration |
| File | Lines | Purpose |
|---|---|---|
llama/ggml-dars.h |
~320 | Portable C header — 17 frameworks, all tunables |
llama/ggml-dars.c |
~850 | Core: PID, Kalman, Hysteresis, Percolation, etc. |
llama/ggml-dars-rocm.cpp |
~200 | ROCm 7.1: hipMemGetInfo, hipMemcpyAsync, gfx1201 detection |
llama/ggml-dars-vulkan.cpp |
~300 | Vulkan cooperative matrix: extension detection + dispatch |
llama/mul_mm_coopmat_fp16.comp |
~80 | GLSL shader: 16×16×16 FP16 WMMA tiles for RDNA4 |
llama/ggml-dars-hebbian.h |
~250 | Profiler header: trace format, layer stats, pruning config |
llama/ggml-dars-hebbian.cpp |
~600 | Profiler: EMA recording, normalization, top-K, overlap |
llama/ggml-dars-extract.cpp |
~500 | GGUF I/O: prune, extract, merge operations |
llama/ggml-dars-merge.h |
~180 | Merge header: SLERP, TIES, DARE, linear methods |
llama/ggml-dars-merge.cpp |
~450 | Merge kernels: all mathematical implementations |
llama/ggml-dars-dual.h |
~280 | Dual-model header: cascade, attractor, phase detector, RAG |
llama/ggml-dars-dual.cpp |
~700 | Dual-model: 5-step inference pipeline |
llama/ggml-dars-upcycle.h |
~200 | Upcycle header: clustering methods, MoE config |
llama/ggml-dars-upcycle.cpp |
~480 | Upcycle: k-means, hebbian clustering, GGUF MoE writer |
llama/llama-dars-integration-v2.cpp |
~450 | 14 hook functions with exact placement instructions |
llama/CMakeLists-dars-snippet.txt |
~20 | Drop-in CMake block |
Total: ~5,800 lines of C/C++ across 15 files.
| File | Purpose |
|---|---|
discover/amd_gfx1201.go |
GPU detection + automatic tuning |
cmake/gfx1201.cmake |
Build system integration |
runner/gfx1201_config.go |
Runtime config (forces OLLAMA_LLM_LIBRARY=rocm) |
llama/ggml-cuda/gfx1201_kernels.cuh |
GPU kernels with conditional safety |
scripts/build_gfx1201.sh |
Linux clean build script |
scripts/env_gfx1201.sh |
Linux environment setup |
build_gfx1201.ps1 |
Windows build script (ROCm + Vulkan) |
| File | Purpose |
|---|---|
server/dashboard.go |
REST API + static file server |
server/dashboard.html |
Browser dashboard UI |
server/dashboard_metrics_windows.go |
Windows GPU metrics (WMI) |
server/dashboard_metrics_fallback.go |
Fallback for non-Windows |
| Script | Purpose |
|---|---|
Run_All_Benchmarks.ps1 |
Full 13-model benchmark suite |
Run_Codegen_Benchmarks.ps1 |
Code generation quality benchmarks |
Run_Vulkan_Benchmarks.ps1 |
Vulkan backend throughput benchmarks |
multimodel_bench.ps1 |
Multi-model concurrent throughput |
Granite_Benchmark.ps1 |
Granite multi-layer latency tests |
Same as upstream Ollama — MIT License.
DARS v2.0 files are additive patches and do not modify Ollama's core license.