[tx] Implement cutlass kernel for ragged_dot with group_offset #896

pcmoritz · 2026-01-19T07:01:43Z

This brings down the step time of

uv run --with wandb --with tinker==0.3.0 sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3-30B-A3B lora_rank=1 max_length=512

with

uv run --extra gpu --extra tinker -m tx.tinker.api     --base-model Qwen/Qwen3-30B-A3B     --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 8, "expert_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

from 40s to 20s. I spend some time tuning the tile sizes and also tried different tile sizes / configurations for different settings (e.g. the different projections or low k setting for LoRA), but it only made a very small difference and wouldn't be worth the complexity for now.

…o tx-ragged-dot-cutlass

Reverts changes from commits c23c9e7 and 2b1c3d6. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

pcmoritz added 30 commits January 17, 2026 18:04

add cutlass ragged dot

12cc124

update

70dce5f

update

d70e010

update

ac85d10

update

106e4ae

add backward

94eb625

update

cf80c97

update

7f1fe1d

use grouped gemm

1d9df09

update

527c1a0

update

656756f

update

aee36c7

Merge branch 'tx-ragged-dot-cutlass' of github.com:pcmoritz/SkyRL int…

cfb3404

…o tx-ragged-dot-cutlass

update

6e9ead9

update

63914e7

update

f92d00d

fix

ad0bfee

update

70cba86

optimize

3f4dd25

fixes

3f6669d

optimize

7b22f86

try to use clusters

accff8e

update schedule

f1fb36c

try tile size

b1c48f4

update

046a033

optimize

5b14a8a

optimize

23a74e5

simplify

4c86409

simplify

2dcce20

add lto

e731efe

pcmoritz and others added 30 commits January 19, 2026 14:10

fix

5cf3521

update tile size

2fb40bc

fine grained sweeping (works before)

c23c9e7

optimize

2b1c3d6

Revert to state at 2fb40bc (update tile size)

2282958

Reverts changes from commits c23c9e7 and 2b1c3d6. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

use kernel for everything

f4cfed2

update

3feab96

update

f03a1e0

update

89b452f

revert

2b997fa

update

ddf1ab7

fix

fc8c75b

update

b0e14f3

update

79b0d44

add tests for lora

1a7485f

update

8fa5c2e

update

f6a6a92

update

2db8415

update

ae19ae9

update tiles

988cc06

update

0167744

update

3dfcc14

update

50357b4

update

58956a4

update

447305d

update

c1f1ab1

update

fa814f1

update

f72b285

update

2ef0d7e

update

59be86f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

Uh oh!

pcmoritz commented Jan 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

Are you sure you want to change the base?

[tx] Implement cutlass kernel for ragged_dot with group_offset #896

Uh oh!

Conversation

pcmoritz commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pcmoritz commented Jan 19, 2026 •

edited

Loading