QwenLM · weicj · May 18, 2026 · May 18, 2026 · Jun 8, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,52 @@
+# AGENTS.md
+
+This file governs the whole `FlashQLA-SM70-SM75` repository.
+
+## Project Identity And Credit
+
+This repository is an experimental SM70/SM75 fork of upstream
+`QwenLM/FlashQLA`. It exists to preserve and review the legacy Gated DeltaNet
+forward-inference path used by the 2080 Ti / SM75 runtime work.
+
+If you publish, redistribute, repackage, benchmark, or build a derivative from
+this repository, keep clear credit to:
+
+- Upstream `QwenLM/FlashQLA` and its original license.
+- `FlashQLA-SM70-SM75`.
+- The repository author: `github.com/weicj`.
+- The related `vLLM 2080 Ti Definitive Edition` / 2080 Ti SM75 runtime work
+  when using this fork as part of that stack.
+
+Do not remove existing attribution, license notices, benchmark provenance, or
+project identity text. Public derivatives should state that they are based on
+this fork unless the relevant material has been independently replaced.
+
+## Upstream Compatibility
+
+- Preserve upstream FlashQLA license and copyright notices.
+- Keep the Hopper/SM90 upstream path intact unless a change is explicitly meant
+  for upstream compatibility.
+- Keep SM70/SM75 behavior explicit. Do not silently replace upstream high-level
+  APIs with legacy-device behavior.
+- Do not present SM70/SM75 fork behavior or benchmark numbers as official
+  upstream FlashQLA behavior.
+- Follow upstream instructions and contribution rules for files inherited from
+  `QwenLM/FlashQLA`.
+
+## Evidence And Benchmark Rules
+
+- Do not claim SM70 or SM75 support without compile/runtime evidence.
+- Keep SM70 compile coverage, SM70 runtime validation, and SM75 runtime
+  validation separate.
+- Report benchmark scope exactly: device, shape, dtype, API entry point, and
+  whether the result is standalone-kernel, engine-profile, or whole-request.
+- Mark unverified paths as experimental or pending validation.
+
+## Repository Hygiene
+
+- Do not commit local caches, model weights, logs, temporary workspace state,
+  run outputs, or generated native build artifacts.
+- Prefer small, reviewable patches that keep the legacy backend isolated.
+- Before publishing changes, run the relevant syntax, import, CUDA build, and
+  test checks for the files you touched.
+
diff --git a/README.md b/README.md
@@ -1,3 +1,86 @@
+> [!IMPORTANT]
+> This repository is an experimental SM70/SM75 fork of [QwenLM/FlashQLA](https://github.com/QwenLM/FlashQLA).
+>
+> It is not an official FlashQLA release and does not replace the upstream Hopper/SM90 implementation.
+
+# FlashQLA-SM70-SM75
+
+Experimental forward-inference support for Qwen-style Gated DeltaNet on SM70/SM75-class NVIDIA GPUs.
+
+This fork keeps the upstream Hopper/SM90 TileLang path intact and adds an explicit legacy backend entry point for Volta/Turing inference devices. The current runtime validation target is RTX 2080 Ti / SM75. SM70 currently has compile coverage, but V100-class runtime validation is still required before making performance claims.
+
+## Changes in This Fork
+
+- Adds `flash_qla.ops.gated_delta_rule.legacy.chunk_gated_delta_rule_fwd_legacy`.
+- Adds a lazy-built CUDA extension for a forward-only SM70/SM75-class Gated DeltaNet backend.
+- Keeps the upstream Hopper/SM90 TileLang path unchanged.
+- Keeps the legacy path explicit instead of silently replacing the upstream high-level API.
+- Adds CUDA correctness tests for the supported legacy path.
+- Documents the supported scope, validation status, and benchmark caveats separately from upstream Hopper results.
+
+## Supported Scope
+
+Supported:
+
+- forward inference only
+- SM70/SM75-class CUDA devices as the intended legacy target family
+- scalar-gate Gated DeltaNet
+- Qwen-style grouped-query head mapping
+- primary optimized shape: `D=128`
+- explicit legacy API entry point
+
+Not supported:
+
+- backward kernels or training
+- automatic dispatch from the upstream high-level API
+- generic support for all pre-Hopper NVIDIA GPUs
+- runtime performance claims for SM70 before V100-class validation
+- SM80/SM86/SM89 support claims
+- automatic default dispatch for non-Hopper devices
+
+## Current Validation
+
+Runtime validation was performed on RTX 2080 Ti / SM75.
+
+Standalone kernel timing for a Qwen-like shape:
+
+- `B=1, T=512, Hq=16, Hv=32, D=128`
+- control recurrent path: about `1.126 ms`
+- optimized legacy path on SM75: about `0.520-0.533 ms`
+- GDN-stage speedup: about `2.1x`
+
+GGUF runtime profiling on SM75:
+
+- default fused GDN: `406.656 ms`
+- legacy fast path: `195.105 ms`
+- GDN-stage speedup: about `2.08x`
+
+Whole-request impact under the same server parameters:
+
+- prefill: `+7.17%`
+- decode: `+0.61%`
+- wall time: `-3.49%`
+
+SM70 status:
+
+- compile check passes
+- runtime validation is pending
+- V100-class benchmarking is needed before claiming SM70 performance
+
+Fork wrapper status:
+
+- Python syntax check passes
+- CUDA tests are included under `tests/test_legacy_sm_gdn.py`
+- CUDA PyTorch runtime validation still requires a CUDA-enabled PyTorch environment
+
+## Positioning
+
+This fork is meant to make the SM70/SM75 experiment reproducible and reviewable. It should be treated as an upstreamable experimental branch, not as a separate long-term replacement for FlashQLA.
+
+---
+
+The original upstream README follows below.
+
 <p align="center">
     <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/flashqla/flashqla.png" width="1000"/>
 <p>

diff --git a/flash_qla/ops/gated_delta_rule/legacy/__init__.py b/flash_qla/ops/gated_delta_rule/legacy/__init__.py
@@ -0,0 +1,6 @@
+# Copyright (c) 2026 The Qwen team, Alibaba Group.
+# Licensed under The MIT License [see LICENSE for details]
+
+from .sm_legacy import chunk_gated_delta_rule_fwd_legacy
+
+__all__ = ["chunk_gated_delta_rule_fwd_legacy"]