From eca3e2c3a2e31e9f334fe7601c779bd0c21de2b1 Mon Sep 17 00:00:00 2001 From: weicj Date: Mon, 18 May 2026 21:17:20 +0800 Subject: [PATCH 1/3] docs: describe SM70 SM75 fork scope --- README.md | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) diff --git a/README.md b/README.md index 94dd127..59d8716 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,78 @@ +> [!IMPORTANT] +> This repository is an experimental SM70/SM75 fork of [QwenLM/FlashQLA](https://github.com/QwenLM/FlashQLA). +> +> It is not an official FlashQLA release and does not replace the upstream Hopper/SM90 implementation. + +# FlashQLA-SM70-SM75 + +Experimental forward-inference support for Qwen-style Gated DeltaNet on SM70/SM75-class NVIDIA GPUs. + +This fork keeps the upstream Hopper/SM90 TileLang path intact and is intended to add a guarded legacy backend for Volta/Turing inference devices. The current runtime validation target is RTX 2080 Ti / SM75. SM70 currently has compile coverage, but V100-class runtime validation is still required before making performance claims. + +## Intended Changes + +- Add an experimental forward-only Gated DeltaNet backend for SM70/SM75-class devices. +- Add guarded dispatch so the legacy backend is used only when the device, shape, and explicit opt-in flag match the supported path. +- Fall back to upstream behavior for unsupported devices, unsupported shapes, and flag-disabled runs. +- Add correctness and negative-dispatch coverage for the supported legacy path. +- Document the supported scope, validation status, and benchmark caveats separately from upstream Hopper results. + +## Supported Scope + +Supported: + +- forward inference only +- SM70/SM75-class CUDA devices as the intended legacy target family +- scalar-gate Gated DeltaNet +- Qwen-style grouped-query head mapping +- primary optimized shape: `D=128` +- explicit opt-in guard + +Not supported: + +- backward kernels or training +- generic support for all pre-Hopper NVIDIA GPUs +- runtime performance claims for SM70 before V100-class validation +- SM80/SM86/SM89 support claims +- automatic default dispatch for non-Hopper devices + +## Current Validation + +Runtime validation was performed on RTX 2080 Ti / SM75. + +Standalone kernel timing for a Qwen-like shape: + +- `B=1, T=512, Hq=16, Hv=32, D=128` +- control recurrent path: about `1.126 ms` +- optimized legacy path on SM75: about `0.520-0.533 ms` +- GDN-stage speedup: about `2.1x` + +GGUF runtime profiling on SM75: + +- default fused GDN: `406.656 ms` +- legacy fast path: `195.105 ms` +- GDN-stage speedup: about `2.08x` + +Whole-request impact under the same server parameters: + +- prefill: `+7.17%` +- decode: `+0.61%` +- wall time: `-3.49%` + +SM70 status: + +- compile check passes +- runtime validation is pending +- V100-class benchmarking is needed before claiming SM70 performance + +## Positioning + +This fork is meant to make the SM70/SM75 experiment reproducible and reviewable. It should be treated as an upstreamable experimental branch, not as a separate long-term replacement for FlashQLA. + +--- + +The original upstream README follows below. +

From 3ab27d77d8ca01d7a4718903b726add1a8886c0e Mon Sep 17 00:00:00 2001 From: weicj Date: Mon, 18 May 2026 21:30:08 +0800 Subject: [PATCH 2/3] ops: add SM70 SM75 legacy GDN forward backend --- README.md | 24 +- .../ops/gated_delta_rule/legacy/__init__.py | 6 + .../legacy/csrc/gdn_forward.cu | 348 ++++++++++++++++++ .../ops/gated_delta_rule/legacy/sm_legacy.py | 104 ++++++ tests/test_legacy_sm_gdn.py | 57 +++ 5 files changed, 531 insertions(+), 8 deletions(-) create mode 100644 flash_qla/ops/gated_delta_rule/legacy/__init__.py create mode 100644 flash_qla/ops/gated_delta_rule/legacy/csrc/gdn_forward.cu create mode 100644 flash_qla/ops/gated_delta_rule/legacy/sm_legacy.py create mode 100644 tests/test_legacy_sm_gdn.py diff --git a/README.md b/README.md index 59d8716..5a4dbd1 100644 --- a/README.md +++ b/README.md @@ -7,15 +7,16 @@ Experimental forward-inference support for Qwen-style Gated DeltaNet on SM70/SM75-class NVIDIA GPUs. -This fork keeps the upstream Hopper/SM90 TileLang path intact and is intended to add a guarded legacy backend for Volta/Turing inference devices. The current runtime validation target is RTX 2080 Ti / SM75. SM70 currently has compile coverage, but V100-class runtime validation is still required before making performance claims. +This fork keeps the upstream Hopper/SM90 TileLang path intact and adds an explicit legacy backend entry point for Volta/Turing inference devices. The current runtime validation target is RTX 2080 Ti / SM75. SM70 currently has compile coverage, but V100-class runtime validation is still required before making performance claims. -## Intended Changes +## Changes in This Fork -- Add an experimental forward-only Gated DeltaNet backend for SM70/SM75-class devices. -- Add guarded dispatch so the legacy backend is used only when the device, shape, and explicit opt-in flag match the supported path. -- Fall back to upstream behavior for unsupported devices, unsupported shapes, and flag-disabled runs. -- Add correctness and negative-dispatch coverage for the supported legacy path. -- Document the supported scope, validation status, and benchmark caveats separately from upstream Hopper results. +- Adds `flash_qla.ops.gated_delta_rule.legacy.chunk_gated_delta_rule_fwd_legacy`. +- Adds a lazy-built CUDA extension for a forward-only SM70/SM75-class Gated DeltaNet backend. +- Keeps the upstream Hopper/SM90 TileLang path unchanged. +- Keeps the legacy path explicit instead of silently replacing the upstream high-level API. +- Adds CUDA correctness tests for the supported legacy path. +- Documents the supported scope, validation status, and benchmark caveats separately from upstream Hopper results. ## Supported Scope @@ -26,11 +27,12 @@ Supported: - scalar-gate Gated DeltaNet - Qwen-style grouped-query head mapping - primary optimized shape: `D=128` -- explicit opt-in guard +- explicit legacy API entry point Not supported: - backward kernels or training +- automatic dispatch from the upstream high-level API - generic support for all pre-Hopper NVIDIA GPUs - runtime performance claims for SM70 before V100-class validation - SM80/SM86/SM89 support claims @@ -65,6 +67,12 @@ SM70 status: - runtime validation is pending - V100-class benchmarking is needed before claiming SM70 performance +Fork wrapper status: + +- Python syntax check passes +- CUDA tests are included under `tests/test_legacy_sm_gdn.py` +- CUDA PyTorch runtime validation still requires a CUDA-enabled PyTorch environment + ## Positioning This fork is meant to make the SM70/SM75 experiment reproducible and reviewable. It should be treated as an upstreamable experimental branch, not as a separate long-term replacement for FlashQLA. diff --git a/flash_qla/ops/gated_delta_rule/legacy/__init__.py b/flash_qla/ops/gated_delta_rule/legacy/__init__.py new file mode 100644 index 0000000..fa7921f --- /dev/null +++ b/flash_qla/ops/gated_delta_rule/legacy/__init__.py @@ -0,0 +1,6 @@ +# Copyright (c) 2026 The Qwen team, Alibaba Group. +# Licensed under The MIT License [see LICENSE for details] + +from .sm_legacy import chunk_gated_delta_rule_fwd_legacy + +__all__ = ["chunk_gated_delta_rule_fwd_legacy"] diff --git a/flash_qla/ops/gated_delta_rule/legacy/csrc/gdn_forward.cu b/flash_qla/ops/gated_delta_rule/legacy/csrc/gdn_forward.cu new file mode 100644 index 0000000..088f517 --- /dev/null +++ b/flash_qla/ops/gated_delta_rule/legacy/csrc/gdn_forward.cu @@ -0,0 +1,348 @@ +#include + +#include +#include + +#include +#include +#include +#include + +namespace { + +void check_cuda(cudaError_t status, const char* context) { + if (status != cudaSuccess) { + throw std::runtime_error(std::string(context) + ": " + + cudaGetErrorString(status)); + } +} + +__device__ __forceinline__ float subgroup_sum_lane0(float value, + int width) { + constexpr unsigned mask = 0xffffffffU; + for (int offset = width / 2; offset > 0; offset >>= 1) { + value += __shfl_down_sync(mask, value, offset, width); + } + return value; +} + +__device__ __forceinline__ float subgroup_broadcast_lane0(float value, + int width) { + return __shfl_sync(0xffffffffU, value, 0, width); +} + +template +__global__ void gdn_forward_kernel(const float* __restrict__ q, + const float* __restrict__ k, + const float* __restrict__ v, + const float* __restrict__ gate, + const float* __restrict__ beta, + const float* __restrict__ initial_state, + float* __restrict__ output, + float* __restrict__ final_state, + int batch, + int tokens, + int q_heads, + int v_heads, + float scale) { + static_assert(D % (COLS * (32 / WIDTH)) == 0); + constexpr int subgroups_per_warp = 32 / WIDTH; + constexpr int rows_per_lane = (D + WIDTH - 1) / WIDTH; + + const int hv = blockIdx.x; + const int b = blockIdx.y; + const int subgroup = threadIdx.x / WIDTH; + const int lane = threadIdx.x % WIDTH; + const int group_base = + (blockIdx.z * blockDim.y + threadIdx.y) * subgroups_per_warp + subgroup; + const int col_base = group_base * COLS; + const int hq = hv / (v_heads / q_heads); + + float state_shard[COLS][rows_per_lane]; + +#pragma unroll + for (int c = 0; c < COLS; ++c) { + const int col = col_base + c; +#pragma unroll + for (int r = 0; r < rows_per_lane; ++r) { + const int row = r * WIDTH + lane; + float value = 0.0F; + if (row < D) { + const auto state_index = + (((static_cast(b) * v_heads + hv) * D + col) * D) + row; + value = initial_state == nullptr ? 0.0F : initial_state[state_index]; + } + state_shard[c][r] = value; + } + } + + for (int t = 0; t < tokens; ++t) { + const auto gate_index = + ((static_cast(b) * tokens + t) * v_heads + hv); + float gate_value = 0.0F; + float beta_value = 0.0F; + if (threadIdx.x == 0) { + gate_value = __expf(gate[gate_index]); + beta_value = beta[gate_index]; + } + gate_value = __shfl_sync(0xffffffffU, gate_value, 0); + beta_value = __shfl_sync(0xffffffffU, beta_value, 0); + + float k_reg[rows_per_lane]; + float q_reg[rows_per_lane]; + float kv_partial[COLS]; +#pragma unroll + for (int c = 0; c < COLS; ++c) { + kv_partial[c] = 0.0F; + } + +#pragma unroll + for (int r = 0; r < rows_per_lane; ++r) { + const int row = r * WIDTH + lane; + float q_value = 0.0F; + float k_value = 0.0F; + if (row < D) { + const auto qk_index = + (((static_cast(b) * tokens + t) * q_heads + hq) * D) + row; + q_value = q[qk_index]; + k_value = k[qk_index]; + } + q_reg[r] = q_value; + k_reg[r] = k_value; +#pragma unroll + for (int c = 0; c < COLS; ++c) { + kv_partial[c] += state_shard[c][r] * k_value; + } + } + + float delta[COLS]; +#pragma unroll + for (int c = 0; c < COLS; ++c) { + const float kv_col = subgroup_sum_lane0(kv_partial[c], WIDTH); + float delta_value = 0.0F; + if (lane == 0) { + const auto v_index = + (((static_cast(b) * tokens + t) * v_heads + hv) * D) + + col_base + c; + delta_value = (v[v_index] - gate_value * kv_col) * beta_value; + } + delta[c] = subgroup_broadcast_lane0(delta_value, WIDTH); + } + + float attn_partial[COLS]; +#pragma unroll + for (int c = 0; c < COLS; ++c) { + attn_partial[c] = 0.0F; + } + +#pragma unroll + for (int r = 0; r < rows_per_lane; ++r) { +#pragma unroll + for (int c = 0; c < COLS; ++c) { + const float new_state = + fmaf(k_reg[r], delta[c], gate_value * state_shard[c][r]); + state_shard[c][r] = new_state; + attn_partial[c] += new_state * q_reg[r]; + } + } + +#pragma unroll + for (int c = 0; c < COLS; ++c) { + attn_partial[c] = subgroup_sum_lane0(attn_partial[c], WIDTH); + } + + if (lane == 0) { + const auto out_base = + (((static_cast(b) * tokens + t) * v_heads + hv) * D); +#pragma unroll + for (int c = 0; c < COLS; ++c) { + output[out_base + col_base + c] = attn_partial[c] * scale; + } + } + } + +#pragma unroll + for (int c = 0; c < COLS; ++c) { + const int col = col_base + c; +#pragma unroll + for (int r = 0; r < rows_per_lane; ++r) { + const int row = r * WIDTH + lane; + if (row < D) { + const auto state_index = + (((static_cast(b) * v_heads + hv) * D + col) * D) + row; + final_state[state_index] = state_shard[c][r]; + } + } + } +} + +template +void launch_gdn_forward(const float* q, + const float* k, + const float* v, + const float* gate, + const float* beta, + const float* initial_state, + float* output, + float* final_state, + int batch, + int tokens, + int q_heads, + int v_heads, + float scale, + cudaStream_t stream) { + constexpr int cols = D == 128 ? 4 : 1; + constexpr int width = D == 128 ? 16 : 32; + constexpr int groups_per_warp = 32 / width; + constexpr int column_groups_per_block = 8; + const dim3 block(32, column_groups_per_block); + const int groups = D / cols; + const int z = (groups + column_groups_per_block * groups_per_warp - 1) / + (column_groups_per_block * groups_per_warp); + const dim3 grid(v_heads, batch, z); + gdn_forward_kernel + <<>>(q, + k, + v, + gate, + beta, + initial_state, + output, + final_state, + batch, + tokens, + q_heads, + v_heads, + scale); +} + +void validate_tensor(const torch::Tensor& tensor, + const char* name, + int64_t dims) { + TORCH_CHECK(tensor.is_cuda(), name, " must be a CUDA tensor"); + TORCH_CHECK(tensor.scalar_type() == torch::kFloat32, + name, + " must be float32"); + TORCH_CHECK(tensor.is_contiguous(), name, " must be contiguous"); + TORCH_CHECK(tensor.dim() == dims, name, " has wrong rank"); +} + +} // namespace + +std::vector gdn_forward(torch::Tensor q, + torch::Tensor k, + torch::Tensor v, + torch::Tensor gate, + torch::Tensor beta, + c10::optional initial_state, + double scale) { + validate_tensor(q, "q", 4); + validate_tensor(k, "k", 4); + validate_tensor(v, "v", 4); + validate_tensor(gate, "gate", 3); + validate_tensor(beta, "beta", 3); + + TORCH_CHECK(q.sizes() == k.sizes(), "q and k must have the same shape"); + const int batch = static_cast(q.size(0)); + const int tokens = static_cast(q.size(1)); + const int q_heads = static_cast(q.size(2)); + const int dim = static_cast(q.size(3)); + const int v_heads = static_cast(v.size(2)); + TORCH_CHECK(v.size(0) == batch && v.size(1) == tokens && v.size(3) == dim, + "v must have shape [B, T, Hv, D] matching q/k"); + TORCH_CHECK(gate.size(0) == batch && gate.size(1) == tokens && + gate.size(2) == v_heads, + "gate must have shape [B, T, Hv]"); + TORCH_CHECK(beta.sizes() == gate.sizes(), + "beta must have the same shape as gate"); + TORCH_CHECK(v_heads % q_heads == 0, "Hv must be divisible by Hq"); + TORCH_CHECK(dim == 16 || dim == 32 || dim == 64 || dim == 128, + "D must be one of 16, 32, 64, or 128"); + + const float* initial_ptr = nullptr; + if (initial_state.has_value() && initial_state.value().defined()) { + const auto& h0 = initial_state.value(); + validate_tensor(h0, "initial_state", 4); + TORCH_CHECK(h0.size(0) == batch && h0.size(1) == v_heads && + h0.size(2) == dim && h0.size(3) == dim, + "initial_state must have shape [B, Hv, D, D]"); + initial_ptr = h0.data_ptr(); + } + + auto output = torch::empty_like(v); + auto final_state = torch::empty({batch, v_heads, dim, dim}, q.options()); + + const auto stream = at::cuda::getCurrentCUDAStream(q.device().index()).stream(); + switch (dim) { + case 16: + launch_gdn_forward<16>(q.data_ptr(), + k.data_ptr(), + v.data_ptr(), + gate.data_ptr(), + beta.data_ptr(), + initial_ptr, + output.data_ptr(), + final_state.data_ptr(), + batch, + tokens, + q_heads, + v_heads, + static_cast(scale), + stream); + break; + case 32: + launch_gdn_forward<32>(q.data_ptr(), + k.data_ptr(), + v.data_ptr(), + gate.data_ptr(), + beta.data_ptr(), + initial_ptr, + output.data_ptr(), + final_state.data_ptr(), + batch, + tokens, + q_heads, + v_heads, + static_cast(scale), + stream); + break; + case 64: + launch_gdn_forward<64>(q.data_ptr(), + k.data_ptr(), + v.data_ptr(), + gate.data_ptr(), + beta.data_ptr(), + initial_ptr, + output.data_ptr(), + final_state.data_ptr(), + batch, + tokens, + q_heads, + v_heads, + static_cast(scale), + stream); + break; + case 128: + launch_gdn_forward<128>(q.data_ptr(), + k.data_ptr(), + v.data_ptr(), + gate.data_ptr(), + beta.data_ptr(), + initial_ptr, + output.data_ptr(), + final_state.data_ptr(), + batch, + tokens, + q_heads, + v_heads, + static_cast(scale), + stream); + break; + } + check_cuda(cudaGetLastError(), "gdn_forward launch"); + return {output, final_state}; +} + +PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { + m.def("gdn_forward", &gdn_forward, "SM70/SM75 legacy GDN forward"); +} diff --git a/flash_qla/ops/gated_delta_rule/legacy/sm_legacy.py b/flash_qla/ops/gated_delta_rule/legacy/sm_legacy.py new file mode 100644 index 0000000..c0bbe45 --- /dev/null +++ b/flash_qla/ops/gated_delta_rule/legacy/sm_legacy.py @@ -0,0 +1,104 @@ +# Copyright (c) 2026 The Qwen team, Alibaba Group. +# Licensed under The MIT License [see LICENSE for details] + +from __future__ import annotations + +import os +from pathlib import Path + +import torch +from torch.utils.cpp_extension import load + +_EXT = None + + +def _load_ext(): + global _EXT + if _EXT is not None: + return _EXT + + if not torch.cuda.is_available(): + raise RuntimeError("SM70/SM75 legacy GDN backend requires CUDA") + + os.environ.setdefault("TORCH_CUDA_ARCH_LIST", "7.0;7.5") + src = Path(__file__).with_name("csrc") / "gdn_forward.cu" + _EXT = load( + name="flash_qla_legacy_gdn", + sources=[str(src)], + extra_cuda_cflags=["-O3"], + extra_cflags=["-O3"], + verbose=bool(int(os.environ.get("FLASH_QLA_LEGACY_VERBOSE_BUILD", "0"))), + ) + return _EXT + + +def _check_inputs( + q: torch.Tensor, + k: torch.Tensor, + v: torch.Tensor, + g: torch.Tensor, + beta: torch.Tensor, + initial_state: torch.Tensor | None, +) -> None: + tensors = [q, k, v, g, beta] + if initial_state is not None: + tensors.append(initial_state) + + if any(not tensor.is_cuda for tensor in tensors): + raise ValueError("legacy GDN tensors must be CUDA tensors") + if any(tensor.dtype != torch.float32 for tensor in tensors): + raise ValueError("legacy GDN backend currently supports float32 tensors only") + if any(not tensor.is_contiguous() for tensor in tensors): + raise ValueError("legacy GDN tensors must be contiguous") + if q.ndim != 4 or k.ndim != 4 or v.ndim != 4: + raise ValueError("q, k, and v must have shape [B, T, H, D]") + if g.ndim != 3 or beta.ndim != 3: + raise ValueError("g and beta must have shape [B, T, Hv]") + if q.shape != k.shape: + raise ValueError("q and k must have the same shape") + + batch, tokens, q_heads, dim = q.shape + if v.shape[0] != batch or v.shape[1] != tokens or v.shape[3] != dim: + raise ValueError("v must have shape [B, T, Hv, D] matching q/k") + if g.shape != beta.shape or g.shape != v.shape[:3]: + raise ValueError("g and beta must have shape [B, T, Hv]") + if v.shape[2] % q_heads != 0: + raise ValueError("Hv must be divisible by Hq") + if dim not in (16, 32, 64, 128): + raise ValueError("legacy GDN backend supports D in {16, 32, 64, 128}") + if initial_state is not None and initial_state.shape != (batch, v.shape[2], dim, dim): + raise ValueError("initial_state must have shape [B, Hv, D, D]") + + +def chunk_gated_delta_rule_fwd_legacy( + q: torch.Tensor, + k: torch.Tensor, + v: torch.Tensor, + g: torch.Tensor, + beta: torch.Tensor, + scale: float | None = None, + initial_state: torch.Tensor | None = None, +) -> tuple[torch.Tensor, torch.Tensor]: + """Run the experimental SM70/SM75 forward-only GDN backend. + + This legacy backend is intentionally explicit. It does not replace the + Hopper/SM90 TileLang path and currently supports only contiguous float32 + tensors for inference-oriented forward execution. + + Shapes: + q, k: [B, T, Hq, D] + v: [B, T, Hv, D] + g, beta: [B, T, Hv] + initial_state: optional [B, Hv, D, D] + + Returns: + output: [B, T, Hv, D] + final_state: [B, Hv, D, D] + """ + + _check_inputs(q, k, v, g, beta, initial_state) + if scale is None: + scale = q.shape[-1] ** -0.5 + + ext = _load_ext() + return ext.gdn_forward(q, k, v, g, beta, initial_state, float(scale)) diff --git a/tests/test_legacy_sm_gdn.py b/tests/test_legacy_sm_gdn.py new file mode 100644 index 0000000..a730b18 --- /dev/null +++ b/tests/test_legacy_sm_gdn.py @@ -0,0 +1,57 @@ +# Copyright (c) 2026 The Qwen team, Alibaba Group. +# Licensed under The MIT License [see LICENSE for details] + +import math + +import pytest +import torch + +from flash_qla.ops.gated_delta_rule.legacy import chunk_gated_delta_rule_fwd_legacy + + +def _reference(q, k, v, g, beta, scale=None, initial_state=None): + batch, tokens, q_heads, dim = q.shape + v_heads = v.shape[2] + scale = scale if scale is not None else dim**-0.5 + state = ( + initial_state.clone() + if initial_state is not None + else torch.zeros(batch, v_heads, dim, dim, device=q.device, dtype=q.dtype) + ) + output = torch.empty_like(v) + for b in range(batch): + for hv in range(v_heads): + hq = hv // (v_heads // q_heads) + for t in range(tokens): + gate = torch.exp(g[b, t, hv]) + delta = (v[b, t, hv] - gate * (state[b, hv].transpose(0, 1) @ k[b, t, hq])) * beta[b, t, hv] + state[b, hv] = gate * state[b, hv] + torch.outer(k[b, t, hq], delta) + output[b, t, hv] = scale * (state[b, hv].transpose(0, 1) @ q[b, t, hq]) + return output, state + + +@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is required") +@pytest.mark.parametrize("dim", [16, 32, 64, 128]) +def test_legacy_sm_gdn_matches_reference(dim): + torch.manual_seed(1000 + dim) + q = torch.randn(1, 5, 2, dim, device="cuda", dtype=torch.float32).contiguous() * 0.05 + k = torch.randn_like(q).contiguous() * 0.05 + v = torch.randn(1, 5, 4, dim, device="cuda", dtype=torch.float32).contiguous() * 0.1 + g = torch.randn(1, 5, 4, device="cuda", dtype=torch.float32).contiguous() * 0.02 - 0.04 + beta = torch.rand(1, 5, 4, device="cuda", dtype=torch.float32).contiguous() + h0 = torch.randn(1, 4, dim, dim, device="cuda", dtype=torch.float32).contiguous() * 0.01 + scale = 1.0 / math.sqrt(dim) + + out_ref, state_ref = _reference(q, k, v, g, beta, scale, h0) + out, state = chunk_gated_delta_rule_fwd_legacy(q, k, v, g, beta, scale, h0) + torch.cuda.synchronize() + + torch.testing.assert_close(out, out_ref, atol=2e-4, rtol=2e-4) + torch.testing.assert_close(state, state_ref, atol=1e-3, rtol=1e-3) + + +@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA is required") +def test_legacy_sm_gdn_rejects_unsupported_dtype(): + q = torch.randn(1, 1, 1, 16, device="cuda", dtype=torch.float16) + with pytest.raises(ValueError, match="float32"): + chunk_gated_delta_rule_fwd_legacy(q, q, q, torch.randn(1, 1, 1, device="cuda"), torch.randn(1, 1, 1, device="cuda")) From b1c0e461d9a38164934c63536ef748ef02c18cd5 Mon Sep 17 00:00:00 2001 From: weicj Date: Mon, 8 Jun 2026 14:48:03 +0800 Subject: [PATCH 3/3] Add fork credit guidance --- AGENTS.md | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 AGENTS.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..3f8b2b1 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,52 @@ +# AGENTS.md + +This file governs the whole `FlashQLA-SM70-SM75` repository. + +## Project Identity And Credit + +This repository is an experimental SM70/SM75 fork of upstream +`QwenLM/FlashQLA`. It exists to preserve and review the legacy Gated DeltaNet +forward-inference path used by the 2080 Ti / SM75 runtime work. + +If you publish, redistribute, repackage, benchmark, or build a derivative from +this repository, keep clear credit to: + +- Upstream `QwenLM/FlashQLA` and its original license. +- `FlashQLA-SM70-SM75`. +- The repository author: `github.com/weicj`. +- The related `vLLM 2080 Ti Definitive Edition` / 2080 Ti SM75 runtime work + when using this fork as part of that stack. + +Do not remove existing attribution, license notices, benchmark provenance, or +project identity text. Public derivatives should state that they are based on +this fork unless the relevant material has been independently replaced. + +## Upstream Compatibility + +- Preserve upstream FlashQLA license and copyright notices. +- Keep the Hopper/SM90 upstream path intact unless a change is explicitly meant + for upstream compatibility. +- Keep SM70/SM75 behavior explicit. Do not silently replace upstream high-level + APIs with legacy-device behavior. +- Do not present SM70/SM75 fork behavior or benchmark numbers as official + upstream FlashQLA behavior. +- Follow upstream instructions and contribution rules for files inherited from + `QwenLM/FlashQLA`. + +## Evidence And Benchmark Rules + +- Do not claim SM70 or SM75 support without compile/runtime evidence. +- Keep SM70 compile coverage, SM70 runtime validation, and SM75 runtime + validation separate. +- Report benchmark scope exactly: device, shape, dtype, API entry point, and + whether the result is standalone-kernel, engine-profile, or whole-request. +- Mark unverified paths as experimental or pending validation. + +## Repository Hygiene + +- Do not commit local caches, model weights, logs, temporary workspace state, + run outputs, or generated native build artifacts. +- Prefer small, reviewable patches that keep the legacy backend isolated. +- Before publishing changes, run the relevant syntax, import, CUDA build, and + test checks for the files you touched. +