[Feat] Add implement for cpu_kda_chunk_fwd by lingebeng · Pull Request #193 · primatrix/pallas-kernel

lingebeng · 2026-04-13T14:30:24Z

Description

Add implement for kda_chunk_fwd

Related Issue

Closes #

Change Type

Checklist

Code passes uv run ruff check src/ tests/ and uv run ruff format src/ tests/
New/modified public APIs have complete docstrings (tensor shapes, dimension semantics, business logic)
Public functions have input assertions (assert or assert_shape_or_none)
Tests added at the appropriate layer (tests/ops/, tests/modules/, tests/layers/, or tests/ref/)
If tops/cpu/ is modified, core developers have been notified and PR is labeled cpu-ref

Test Results

Paste relevant test output here.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-13T14:30:33Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a8614db-18f7-4614-9ad2-353354f95b67

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/kda-cpu-chunk_fwd

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

beaver-infiscale · 2026-04-13T14:30:39Z

Warning

规模超限

此 PR 核心代码变更行数为 695，超出上限 200（不含测试与文档）。

建议操作： 将此 PR 拆分为多个小 PR，每个 PR 的核心代码变更控制在 200 行以内。

gemini-code-assist

Code Review

This pull request implements the JAX CPU reference for the KDA chunk-wise forward operation, mirroring the FLA Triton kernels. It includes sub-functions for WY representation, intra-chunk computation, inter-chunk state propagation, and output generation, with full support for variable-length sequences. A comprehensive test suite is also added to verify correctness against naive recurrent implementations and FLA Triton. The review feedback identifies several opportunities to improve performance and reduce JAX compilation overhead by vectorizing Python loops using native operations like jnp.where and jnp.einsum.

gemini-code-assist · 2026-04-13T14:32:56Z

tops/cpu/ops/kda/chunk.py

+    if gk is not None:
+      gk_g = gathered['gk']  # [total_NT, BT, H, K]
+      for ci in range(total_NT):
+        vl = int(valid[ci])
+        if 0 < vl < BT:
+          gk_g = gk_g.at[ci, vl:].set(gk_g[ci, vl - 1])
+      gk_c = gk_g[None].astype(acc_dt)


This Python loop over chunks can be vectorized using jnp.where and broadcasting. This is more efficient in JAX and avoids potential performance bottlenecks during compilation, especially for large numbers of chunks.

Suggested change

if gk is not None:

gk_g = gathered['gk'] # [total_NT, BT, H, K]

for ci in range(total_NT):

vl = int(valid[ci])

if 0 < vl < BT:

gk_g = gk_g.at[ci, vl:].set(gk_g[ci, vl - 1])

gk_c = gk_g[None].astype(acc_dt)

if gk is not None:

gk_g = gathered['gk']

mask = jnp.arange(BT) < valid[:, None]

last_valid_idx = jnp.maximum(0, valid - 1)

last_val = gk_g[jnp.arange(total_NT), last_valid_idx]

gk_c = jnp.where(mask[:, :, None, None], gk_g, last_val[:, None, :, :])[None].astype(acc_dt)

gemini-code-assist · 2026-04-13T14:32:56Z

tops/cpu/ops/kda/chunk.py

+    if gk is not None:
+      gk_g = gathered['gk']
+      for ci in range(total_NT):
+        vl = int(valid[ci])
+        if 0 < vl < BT:
+          gk_g = gk_g.at[ci, vl:].set(gk_g[ci, vl - 1])
+      gk_c = gk_g[None].astype(acc_dt)


This Python loop over chunks can be vectorized using jnp.where and broadcasting, similar to the logic in recompute_w_u_fwd.

Suggested change

if gk is not None:

gk_g = gathered['gk']

for ci in range(total_NT):

vl = int(valid[ci])

if 0 < vl < BT:

gk_g = gk_g.at[ci, vl:].set(gk_g[ci, vl - 1])

gk_c = gk_g[None].astype(acc_dt)

if gk is not None:

gk_g = gathered['gk']

mask = jnp.arange(BT) < valid[:, None]

last_valid_idx = jnp.maximum(0, valid - 1)

last_val = gk_g[jnp.arange(total_NT), last_valid_idx]

gk_c = jnp.where(mask[:, :, None, None], gk_g, last_val[:, None, :, :])[None].astype(acc_dt)

gemini-code-assist · 2026-04-13T14:32:56Z

tops/cpu/ops/kda/chunk.py

+  for i in range(BT):
+    k_i = k_c[:, :, i:i+1, :, :]
+    g_i = gk_c[:, :, i:i+1, :, :]
+    if safe_gate:
+      sc = i // BC
+      mid = sc * BC + min(BC // 2, min((sc + 1) * BC, BT) - sc * BC - 1)
+      g_mid = gk_c[:, :, mid:mid+1, :, :]
+      col_i = (k_c * _gate_exp(gk_c - g_mid, use_exp2) * _gate_exp(g_mid - g_i, use_exp2) * k_i).sum(axis=-1)
+    else:
+      col_i = (k_c * _gate_exp(gk_c - g_i, use_exp2) * k_i).sum(axis=-1)
+    Akk = Akk.at[:, :, :, :, i].set(col_i)


The loop to build the Akk matrix can be vectorized using jnp.einsum. Python loops with at[...].set(...) are inefficient in JAX as they create many intermediate arrays and increase compilation time. For the safe_gate=False case, this can be simplified to a single batched matrix multiplication using jnp.einsum('bnlhk,bnmhk->bnlhm', k_c * _gate_exp(gk_c, use_exp2), k_c * _gate_exp(-gk_c, use_exp2)). The safe_gate=True case can also be vectorized by precomputing the middle gate values.

gemini-code-assist · 2026-04-13T14:32:56Z

tops/cpu/ops/kda/chunk.py

+  o_list = []
+  for i in range(NT):
+    qg = q_c[:, i] * _gate_exp(g_c[:, i], use_exp2)
+    h_i = h[:, i]
+
+    # Handle transpose_state_layout: h_i may be [H, V, K], convert to [H, K, V]
+    if transpose_state_layout:
+      h_i = jnp.swapaxes(h_i, -1, -2)
+
+    o_inter = scale * jnp.einsum('bchk,bhkv->bchv', qg, h_i)
+    o_intra = jnp.einsum('bchl,blhv->bchv', A_c[:, i], v_c[:, i])
+    o_list.append(o_inter + o_intra)
+
+  o_stacked = jnp.stack(o_list, axis=1)


This loop over chunks (NT) is fully vectorizable using jnp.einsum across the chunk dimension. This will significantly improve performance for long sequences by leveraging JAX's ability to perform batched operations and avoiding the overhead of Python loops and jnp.stack.

h_eff = jnp.swapaxes(h, -1, -2) if transpose_state_layout else h qg = q_c * _gate_exp(g_c, use_exp2) o_inter = scale * jnp.einsum('bnchk,bnhkv->bnchv', qg, h_eff) o_intra = jnp.einsum('bnchl,bnlhv->bnchv', A_c, v_c) o_stacked = o_inter + o_intra

…bf16 xfail) - Add JAX_PLATFORMS=cpu to all 3 KDA test files to prevent JAX from grabbing GPU memory and causing PyTorch CUBLAS allocation failures - Fix import: use `chunk_kda` function instead of `chunk` module - Mark bf16 chunk test as xfail: FLA Triton stores Akk inverse in bfloat16 while CPU ref accumulates in float32, causing large divergence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

beaver-infiscale · 2026-04-14T02:33:35Z

Warning

规模超限

此 PR 核心代码变更行数为 695，超出上限 200（不含测试与文档）。

建议操作： 将此 PR 拆分为多个小 PR，每个 PR 的核心代码变更控制在 200 行以内。

beaver-infiscale · 2026-04-14T05:03:24Z

Warning

规模超限

此 PR 核心代码变更行数为 695，超出上限 200（不含测试与文档）。

建议操作： 将此 PR 拆分为多个小 PR，每个 PR 的核心代码变更控制在 200 行以内。

feat: add impl for kda_chunk_fwd

1350f08

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lingebeng added the cpu-ref Modifies tops/cpu/ reference implementations label Apr 13, 2026

beaver-infiscale bot added the beaver/needs-split label Apr 13, 2026

gemini-code-assist bot reviewed Apr 13, 2026

View reviewed changes

lingebeng changed the title ~~[Feat] Add implement for kda_chunk_fwd~~ [Feat] Add implement for cpu_kda_chunk_fwd Apr 14, 2026

Adjust KDA reference test tolerances

b7752f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add implement for cpu_kda_chunk_fwd#193

[Feat] Add implement for cpu_kda_chunk_fwd#193
lingebeng wants to merge 3 commits intomainfrom
feat/kda-cpu-chunk_fwd

lingebeng commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

Review skipped

Uh oh!

beaver-infiscale bot commented Apr 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

gemini-code-assist bot Apr 13, 2026

Uh oh!

beaver-infiscale bot commented Apr 14, 2026

Uh oh!

beaver-infiscale bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lingebeng commented Apr 13, 2026

Description

Related Issue

Change Type

Checklist

Test Results

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

beaver-infiscale bot commented Apr 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

beaver-infiscale bot commented Apr 14, 2026

Uh oh!

beaver-infiscale bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 13, 2026 •

edited

Loading