[WIP] MQA Logits Gluon Path Activation and New Flag by cagrikymk · Pull Request #794 · ROCm/ATOM

cagrikymk · 2026-05-15T03:37:59Z

This PR relies on Aiter commit: 09e21f3ee9c6243b15e5995353dd793092fb6f05 that adds a skip mechanism to leave the logits beyond start and end indices untouched. Also, it introduces Gluon kernel for fp8 mqa logits that is used for gfx950.

For Gluon kernel to work, Triton >=3.6 needed, while best perf. requires ToT, 3.6 is the minimum version.

I did various testing using DeepSeek-V4-Flash on MI355 with this PR, Triton 3.5.1 and Triton 3.6 (official release versions) and provided the data below.

TLDR:

When triton_kernels is pinned to 3.5.1, just upgrading to Triton 3.6 gives around 6-9% improvement. This PR also improves prefill performance (8k/1) by another 2%.

Not sure if there are other hard constraints forcing us to use 3.5.1 related to other models / architectures.

Metric	Baseline #791 (Triton 3.5.1)	Baseline #791 (Triton 3.6)	This PR + #791 (Triton 3.6)	Δ vs Baseline 3.6	Δ vs Baseline 3.5.1
gsm8k accuracy (flexible/strict)	0.94	0.97	0.95	−0.02	+0.01
C=64,1K/1K — Output tok/s	3900.08	4153.60	4160.90	+0.18%	+6.69%
C=64,1K/1K — Total tok/s	7800.16	8307.21	8321.79	+0.18%	+6.69%
C=64,8K/1 — Output tok/s	5.25	5.72	5.83	+1.92%	+11.05%
C=64,8K/1 — Total tok/s	43025.42	46866.63	47779.22	+1.95%	+11.05%

Baseline (only #791) with Triton/triton_kernels 3.5.1:

Accuracy:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     3|exact_match|↑  | 0.94|±  |0.0239|

C=64, 1K/1K, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  134.43    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              3.81      
Output token throughput (tok/s):         3900.08   
Total Token throughput (tok/s):          7800.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          1023.25   
Median TTFT (ms):                        925.26    
P99 TTFT (ms):                           1458.16   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.42     
Median TPOT (ms):                        15.37     
P99 TPOT (ms):                           16.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.40     
Median ITL (ms):                         14.98     
P99 ITL (ms):                            16.59     
==================================================

C=64, 8K/1, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  97.50     
Total input tokens:                      4194304   
Total generated tokens:                  512       
Request throughput (req/s):              5.25      
Output token throughput (tok/s):         5.25      
Total Token throughput (tok/s):          43025.42  
---------------Time to First Token----------------
Mean TTFT (ms):                          11458.85  
Median TTFT (ms):                        12159.25  
P99 TTFT (ms):                           12193.56  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.17      
Median ITL (ms):                         0.07      
P99 ITL (ms):                            0.63      
==================================================

Baseline (only #791) with Triton 3.6.0 and triton_kernels 3.5.1:

Accuracy:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.97|±  |0.0171|
|     |       |strict-match    |     3|exact_match|↑  | 0.97|±  |0.0171|

C=64, 1K/1K, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  126.22    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              4.06      
Output token throughput (tok/s):         4153.60   
Total Token throughput (tok/s):          8307.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          1000.78   
Median TTFT (ms):                        1153.46   
P99 TTFT (ms):                           1620.05   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.44     
Median TPOT (ms):                        14.39     
P99 TPOT (ms):                           15.16     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.43     
Median ITL (ms):                         14.00     
P99 ITL (ms):                            20.39     
==================================================

C=64, 8K/1, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  89.51     
Total input tokens:                      4194304   
Total generated tokens:                  512       
Request throughput (req/s):              5.72      
Output token throughput (tok/s):         5.72      
Total Token throughput (tok/s):          46866.63  
---------------Time to First Token----------------
Mean TTFT (ms):                          10519.73  
Median TTFT (ms):                        11167.73  
P99 TTFT (ms):                           11178.21  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.20      
Median ITL (ms):                         0.07      
P99 ITL (ms):                            1.07      
==================================================

This PR on top of #791 with Triton 3.6.0 and triton_kernels 3.5.1:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.95|±  |0.0219|
|     |       |strict-match    |     3|exact_match|↑  | 0.95|±  |0.0219|

C=64, 1K/1K, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  126.00    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              4.06      
Output token throughput (tok/s):         4160.90   
Total Token throughput (tok/s):          8321.79   
---------------Time to First Token----------------
Mean TTFT (ms):                          971.13    
Median TTFT (ms):                        937.85    
P99 TTFT (ms):                           1432.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.44     
Median TPOT (ms):                        14.44     
P99 TPOT (ms):                           15.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.43     
Median ITL (ms):                         14.01     
P99 ITL (ms):                            15.82     
==================================================

C=64, 8K/1, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  87.80     
Total input tokens:                      4194304   
Total generated tokens:                  512       
Request throughput (req/s):              5.83      
Output token throughput (tok/s):         5.83      
Total Token throughput (tok/s):          47779.22  
---------------Time to First Token----------------
Mean TTFT (ms):                          10314.71  
Median TTFT (ms):                        10938.01  
P99 TTFT (ms):                           11014.60  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.13      
Median ITL (ms):                         0.05      
P99 ITL (ms):                            0.77      
==================================================

add new flag

76c4441

cagrikymk changed the title ~~MQA Logits Gluon Path Activation and New Flag~~ [WIP] MQA Logits Gluon Path Activation and New Flag May 15, 2026

cagrikymk marked this pull request as draft May 15, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] MQA Logits Gluon Path Activation and New Flag#794

[WIP] MQA Logits Gluon Path Activation and New Flag#794
cagrikymk wants to merge 1 commit into
mainfrom
cagri/mqa_logits_change

cagrikymk commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cagrikymk commented May 15, 2026

TLDR:

Baseline (only #791) with Triton/triton_kernels 3.5.1:

Accuracy:

C=64, 1K/1K, TP=8:

C=64, 8K/1, TP=8:

Baseline (only #791) with Triton 3.6.0 and triton_kernels 3.5.1:

Accuracy:

C=64, 1K/1K, TP=8:

C=64, 8K/1, TP=8:

This PR on top of #791 with Triton 3.6.0 and triton_kernels 3.5.1:

C=64, 1K/1K, TP=8:

C=64, 8K/1, TP=8:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants