Skip to content

[WIP] MQA Logits Gluon Path Activation and New Flag#794

Draft
cagrikymk wants to merge 1 commit into
mainfrom
cagri/mqa_logits_change
Draft

[WIP] MQA Logits Gluon Path Activation and New Flag#794
cagrikymk wants to merge 1 commit into
mainfrom
cagri/mqa_logits_change

Conversation

@cagrikymk
Copy link
Copy Markdown

This PR relies on Aiter commit: 09e21f3ee9c6243b15e5995353dd793092fb6f05 that adds a skip mechanism to leave the logits beyond start and end indices untouched. Also, it introduces Gluon kernel for fp8 mqa logits that is used for gfx950.

For Gluon kernel to work, Triton >=3.6 needed, while best perf. requires ToT, 3.6 is the minimum version.

I did various testing using DeepSeek-V4-Flash on MI355 with this PR, Triton 3.5.1 and Triton 3.6 (official release versions) and provided the data below.

TLDR:

When triton_kernels is pinned to 3.5.1, just upgrading to Triton 3.6 gives around 6-9% improvement. This PR also improves prefill performance (8k/1) by another 2%.

Not sure if there are other hard constraints forcing us to use 3.5.1 related to other models / architectures.

Metric Baseline #791 (Triton 3.5.1) Baseline #791 (Triton 3.6) This PR + #791 (Triton 3.6) Δ vs Baseline 3.6 Δ vs Baseline 3.5.1
gsm8k accuracy (flexible/strict) 0.94 0.97 0.95 −0.02 +0.01
C=64,1K/1K — Output tok/s 3900.08 4153.60 4160.90 +0.18% +6.69%
C=64,1K/1K — Total tok/s 7800.16 8307.21 8321.79 +0.18% +6.69%
C=64,8K/1 — Output tok/s 5.25 5.72 5.83 +1.92% +11.05%
C=64,8K/1 — Total tok/s 43025.42 46866.63 47779.22 +1.95% +11.05%

Baseline (only #791) with Triton/triton_kernels 3.5.1:

Accuracy:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     3|exact_match|↑  | 0.94|±  |0.0239|

C=64, 1K/1K, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  134.43    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              3.81      
Output token throughput (tok/s):         3900.08   
Total Token throughput (tok/s):          7800.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          1023.25   
Median TTFT (ms):                        925.26    
P99 TTFT (ms):                           1458.16   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.42     
Median TPOT (ms):                        15.37     
P99 TPOT (ms):                           16.09     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.40     
Median ITL (ms):                         14.98     
P99 ITL (ms):                            16.59     
==================================================

C=64, 8K/1, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  97.50     
Total input tokens:                      4194304   
Total generated tokens:                  512       
Request throughput (req/s):              5.25      
Output token throughput (tok/s):         5.25      
Total Token throughput (tok/s):          43025.42  
---------------Time to First Token----------------
Mean TTFT (ms):                          11458.85  
Median TTFT (ms):                        12159.25  
P99 TTFT (ms):                           12193.56  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.17      
Median ITL (ms):                         0.07      
P99 ITL (ms):                            0.63      
==================================================

Baseline (only #791) with Triton 3.6.0 and triton_kernels 3.5.1:

Accuracy:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.97|±  |0.0171|
|     |       |strict-match    |     3|exact_match|↑  | 0.97|±  |0.0171|

C=64, 1K/1K, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  126.22    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              4.06      
Output token throughput (tok/s):         4153.60   
Total Token throughput (tok/s):          8307.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          1000.78   
Median TTFT (ms):                        1153.46   
P99 TTFT (ms):                           1620.05   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.44     
Median TPOT (ms):                        14.39     
P99 TPOT (ms):                           15.16     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.43     
Median ITL (ms):                         14.00     
P99 ITL (ms):                            20.39     
==================================================

C=64, 8K/1, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  89.51     
Total input tokens:                      4194304   
Total generated tokens:                  512       
Request throughput (req/s):              5.72      
Output token throughput (tok/s):         5.72      
Total Token throughput (tok/s):          46866.63  
---------------Time to First Token----------------
Mean TTFT (ms):                          10519.73  
Median TTFT (ms):                        11167.73  
P99 TTFT (ms):                           11178.21  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.20      
Median ITL (ms):                         0.07      
P99 ITL (ms):                            1.07      
==================================================

This PR on top of #791 with Triton 3.6.0 and triton_kernels 3.5.1:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  | 0.95|±  |0.0219|
|     |       |strict-match    |     3|exact_match|↑  | 0.95|±  |0.0219|

C=64, 1K/1K, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  126.00    
Total input tokens:                      524288    
Total generated tokens:                  524288    
Request throughput (req/s):              4.06      
Output token throughput (tok/s):         4160.90   
Total Token throughput (tok/s):          8321.79   
---------------Time to First Token----------------
Mean TTFT (ms):                          971.13    
Median TTFT (ms):                        937.85    
P99 TTFT (ms):                           1432.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          14.44     
Median TPOT (ms):                        14.44     
P99 TPOT (ms):                           15.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.43     
Median ITL (ms):                         14.01     
P99 ITL (ms):                            15.82     
==================================================

C=64, 8K/1, TP=8:

============ Serving Benchmark Result ============
Successful requests:                     512       
Benchmark duration (s):                  87.80     
Total input tokens:                      4194304   
Total generated tokens:                  512       
Request throughput (req/s):              5.83      
Output token throughput (tok/s):         5.83      
Total Token throughput (tok/s):          47779.22  
---------------Time to First Token----------------
Mean TTFT (ms):                          10314.71  
Median TTFT (ms):                        10938.01  
P99 TTFT (ms):                           11014.60  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.13      
Median ITL (ms):                         0.05      
P99 ITL (ms):                            0.77      
==================================================

@cagrikymk cagrikymk changed the title MQA Logits Gluon Path Activation and New Flag [WIP] MQA Logits Gluon Path Activation and New Flag May 15, 2026
@cagrikymk cagrikymk marked this pull request as draft May 15, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants