Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# AGENTS.md

This file governs the whole `FlashQLA-SM70-SM75` repository.

## Project Identity And Credit

This repository is an experimental SM70/SM75 fork of upstream
`QwenLM/FlashQLA`. It exists to preserve and review the legacy Gated DeltaNet
forward-inference path used by the 2080 Ti / SM75 runtime work.

If you publish, redistribute, repackage, benchmark, or build a derivative from
this repository, keep clear credit to:

- Upstream `QwenLM/FlashQLA` and its original license.
- `FlashQLA-SM70-SM75`.
- The repository author: `github.com/weicj`.
- The related `vLLM 2080 Ti Definitive Edition` / 2080 Ti SM75 runtime work
when using this fork as part of that stack.

Do not remove existing attribution, license notices, benchmark provenance, or
project identity text. Public derivatives should state that they are based on
this fork unless the relevant material has been independently replaced.

## Upstream Compatibility

- Preserve upstream FlashQLA license and copyright notices.
- Keep the Hopper/SM90 upstream path intact unless a change is explicitly meant
for upstream compatibility.
- Keep SM70/SM75 behavior explicit. Do not silently replace upstream high-level
APIs with legacy-device behavior.
- Do not present SM70/SM75 fork behavior or benchmark numbers as official
upstream FlashQLA behavior.
- Follow upstream instructions and contribution rules for files inherited from
`QwenLM/FlashQLA`.

## Evidence And Benchmark Rules

- Do not claim SM70 or SM75 support without compile/runtime evidence.
- Keep SM70 compile coverage, SM70 runtime validation, and SM75 runtime
validation separate.
- Report benchmark scope exactly: device, shape, dtype, API entry point, and
whether the result is standalone-kernel, engine-profile, or whole-request.
- Mark unverified paths as experimental or pending validation.

## Repository Hygiene

- Do not commit local caches, model weights, logs, temporary workspace state,
run outputs, or generated native build artifacts.
- Prefer small, reviewable patches that keep the legacy backend isolated.
- Before publishing changes, run the relevant syntax, import, CUDA build, and
test checks for the files you touched.

83 changes: 83 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,86 @@
> [!IMPORTANT]
> This repository is an experimental SM70/SM75 fork of [QwenLM/FlashQLA](https://github.com/QwenLM/FlashQLA).
>
> It is not an official FlashQLA release and does not replace the upstream Hopper/SM90 implementation.

# FlashQLA-SM70-SM75

Experimental forward-inference support for Qwen-style Gated DeltaNet on SM70/SM75-class NVIDIA GPUs.

This fork keeps the upstream Hopper/SM90 TileLang path intact and adds an explicit legacy backend entry point for Volta/Turing inference devices. The current runtime validation target is RTX 2080 Ti / SM75. SM70 currently has compile coverage, but V100-class runtime validation is still required before making performance claims.

## Changes in This Fork

- Adds `flash_qla.ops.gated_delta_rule.legacy.chunk_gated_delta_rule_fwd_legacy`.
- Adds a lazy-built CUDA extension for a forward-only SM70/SM75-class Gated DeltaNet backend.
- Keeps the upstream Hopper/SM90 TileLang path unchanged.
- Keeps the legacy path explicit instead of silently replacing the upstream high-level API.
- Adds CUDA correctness tests for the supported legacy path.
- Documents the supported scope, validation status, and benchmark caveats separately from upstream Hopper results.

## Supported Scope

Supported:

- forward inference only
- SM70/SM75-class CUDA devices as the intended legacy target family
- scalar-gate Gated DeltaNet
- Qwen-style grouped-query head mapping
- primary optimized shape: `D=128`
- explicit legacy API entry point

Not supported:

- backward kernels or training
- automatic dispatch from the upstream high-level API
- generic support for all pre-Hopper NVIDIA GPUs
- runtime performance claims for SM70 before V100-class validation
- SM80/SM86/SM89 support claims
- automatic default dispatch for non-Hopper devices

## Current Validation

Runtime validation was performed on RTX 2080 Ti / SM75.

Standalone kernel timing for a Qwen-like shape:

- `B=1, T=512, Hq=16, Hv=32, D=128`
- control recurrent path: about `1.126 ms`
- optimized legacy path on SM75: about `0.520-0.533 ms`
- GDN-stage speedup: about `2.1x`

GGUF runtime profiling on SM75:

- default fused GDN: `406.656 ms`
- legacy fast path: `195.105 ms`
- GDN-stage speedup: about `2.08x`

Whole-request impact under the same server parameters:

- prefill: `+7.17%`
- decode: `+0.61%`
- wall time: `-3.49%`

SM70 status:

- compile check passes
- runtime validation is pending
- V100-class benchmarking is needed before claiming SM70 performance

Fork wrapper status:

- Python syntax check passes
- CUDA tests are included under `tests/test_legacy_sm_gdn.py`
- CUDA PyTorch runtime validation still requires a CUDA-enabled PyTorch environment

## Positioning

This fork is meant to make the SM70/SM75 experiment reproducible and reviewable. It should be treated as an upstreamable experimental branch, not as a separate long-term replacement for FlashQLA.

---

The original upstream README follows below.

<p align="center">
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/flashqla/flashqla.png" width="1000"/>
<p>
Expand Down
6 changes: 6 additions & 0 deletions flash_qla/ops/gated_delta_rule/legacy/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Copyright (c) 2026 The Qwen team, Alibaba Group.
# Licensed under The MIT License [see LICENSE for details]

from .sm_legacy import chunk_gated_delta_rule_fwd_legacy

__all__ = ["chunk_gated_delta_rule_fwd_legacy"]
Loading