-
Notifications
You must be signed in to change notification settings - Fork 74
GroupedBlockQuantizeOp PR1: Adding codegen support #5776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1. refactor existing block_layout op and block_quantization_kernel to re-use existing runtime functions; 2. added runtime function for GroupedBlockQuantizeOp
|
Review updated until commit 3051db0 Description
|
| Relevant files | |||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 25 files
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Tests | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Runtime Function Call Validation
|
Test failures
-
(Medium, 2)
nvFuser float8 kernel requires Blackwell (SM 10.0) – LayoutOpTest.GroupedBlockQuantizeOp failing on A100 & H100Test Name A100 H100 Source LayoutOpTest.GroupedBlockQuantizeOp ❌ ❌ Link
Greptile Summary
Important Files Changed
Confidence score: 4/5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
26 files reviewed, 6 comments
## Context The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors. Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation: i. BlockQuantizationOp produces scaled_tv and block_scaling_factor. ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels The series of PRs tries to merge the two operation into a single one. ### Stacked PRs #5775 GroupedBlockQuantizationOp PR0: Adding runtime function #5776 GroupedBlockQuantizationOp PR1: Adding codegen support #5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark ## What's in this PR 1. refactor existing runtime function for re-use by the new op; 2. added runtime function for GroupedBlockQuantizeOp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds GroupedBlockQuantizationOp, a new IR node that merges the functionality of BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation. This optimization enables single-kernel quantization and layout handling for grouped matrix multiplication operations.
Key Implementation Details
Core Operation: The new op takes a high-precision input tensor and produces:
- A quantized output tensor (same shape as input)
- Block scaling factors with swizzled layout directly suitable for grouped_mm
Architecture: The implementation follows the standard pattern for composite operations:
- IR node definition in
composite_nodes.h/cpp - User-facing API in
ops/arith.cpp - Index lowering in
device_lower/pass/index.cppadds logical indices - Codegen in
codegen.cppgenerates runtime function call - Comprehensive validation in
device_lower/validation.cpp - Scheduler integration for pointwise scheduling
- Test coverage validates correctness against reference implementation
Key Technical Points:
- Supports Float4_e2m1fn (nvfp4) output with Block128x4 layout
- Requires specific parallelization: TIDx, BIDx, and Group parallel types
- Group dimension must be 2/4 for FP32 or 2/4/8 for BF16/FP16 inputs
- Block scales output has allocation domain with padding for swizzled layout
- Vectorization capped at 4 when this op is present
The implementation is thorough and integrates well across all compiler passes including dispatch registration, logical domain mapping, broadcast domain analysis, scheduler topology checks, and non-divisible split handling.
Confidence Score: 4/5
- This PR is safe to merge with minor considerations around attribute access patterns
- The implementation is comprehensive and follows existing patterns well. All necessary integration points are covered including dispatch, validation, index lowering, codegen, scheduler, and domain mapping. The test validates correctness. Score of 4 (not 5) reflects the complexity of the attribute access pattern where row_idx/col_idx are added during index lowering but accessed in codegen - while this works correctly, it creates a subtle dependency that could be error-prone if not well understood
- Pay close attention to csrc/codegen.cpp and csrc/ops/arith.cpp - ensure the attribute indexing pattern (attributeVal(2) and attributeVal(3)) remains valid if constructor signature changes
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| csrc/ir/composite_nodes.h | 5/5 | Added GroupedBlockQuantizationOp class definition with proper accessor methods, constructor signature, and evaluate method stub |
| csrc/ir/composite_nodes.cpp | 5/5 | Implemented GroupedBlockQuantizationOp constructor, toString, toInlineString, and evaluate methods following existing patterns |
| csrc/ops/arith.cpp | 4/5 | Implemented groupedBlockQuantize with proper validation, domain setup, and layout allocation - minor concern about row_idx/col_idx not being passed to initial op construction |
| csrc/codegen.cpp | 4/5 | Added codegen handler for GroupedBlockQuantizationOp with template args, validation, and runtime function call - assumes row_idx/col_idx attributes always present at indices 2 and 3 |
| csrc/device_lower/pass/index.cpp | 5/5 | Added index lowering for GroupedBlockQuantizationOp with proper logical index computation and runtime validation |
| csrc/device_lower/validation.cpp | 5/5 | Added comprehensive validation for GroupedBlockQuantizationOp including memory type checks, parallelization requirements, and scheduling constraints |
| tests/cpp/test_layout_op.cpp | 5/5 | Added test for GroupedBlockQuantizeOp validating quantized output and block scaling factor layout against reference implementation |
Sequence Diagram
sequenceDiagram
participant User
participant OpsAPI as ops/arith.cpp
participant IRNode as GroupedBlockQuantizationOp
participant IndexLower as device_lower/pass/index
participant Validation as device_lower/validation
participant Codegen as codegen.cpp
participant Runtime as Runtime Function
User->>OpsAPI: groupedBlockQuantize(input, offsets, layout)
OpsAPI->>OpsAPI: Validate inputs & data types
OpsAPI->>OpsAPI: Create logical & allocation domains
OpsAPI->>IRNode: Create GroupedBlockQuantizationOp<br/>(without row_idx/col_idx)
IRNode->>OpsAPI: Return quantized_tensor & block_scales
Note over IndexLower: Device Lowering Phase
IndexLower->>IndexLower: Compute logical indices
IndexLower->>IndexLower: Validate inner dim divisibility
IndexLower->>IRNode: Create lowered op<br/>(WITH row_idx/col_idx)
Validation->>IRNode: Validate memory types
Validation->>IRNode: Check parallelization (TIDx, BIDx, Group)
Validation->>IRNode: Verify group dimension & contiguity
Note over Codegen: Code Generation Phase
Codegen->>IRNode: Extract group size from loop domain
Codegen->>IRNode: Validate group size (2/4 or 2/4/8)
Codegen->>IRNode: Access row_idx/col_idx via attributeVal(2,3)
Codegen->>Runtime: Generate call to<br/>bq::grouped_block_quantize_to_nvfp4
Runtime-->>User: Execute quantization kernel
| .slice(0, 0, m_g) | ||
| .slice(1, 0, k); | ||
| .slice(1, 0, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addition of .to(ref.dtype()) to ensure dtype matching in the validation. This handles the case where the reference and output might have different dtypes due to the layout transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds comprehensive codegen support for GroupedBlockQuantizationOp, a new IR node that merges BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation for improved performance in grouped matrix multiplication quantization scenarios.
What Changed
Core IR Implementation:
- Added
GroupedBlockQuantizationOpclass incomposite_nodes.h/cppwith full constructor, accessors, and evaluation methods - The operation takes input tensor, input/output offsets, layout specification (Block128x4), k/g dimensions, optional global scale, and block_size parameter
- Produces quantized output and block scaling factors with swizzled layout
Codegen & Lowering:
- Implemented code generation handler that validates group sizes (2/4/8 for half-precision, 2/4 for float), builds template arguments for layout parameters (32, 4, 4), and generates calls to
bq::grouped_block_quantize_to_nvfp4runtime function - Added index lowering that creates TensorIndex nodes, validates input divisibility by block size, and computes logical indices
- Comprehensive validation checks memory types, parallelization requirements (TIDx, BIDx, Group ID), and schedule ordering
Compiler Integration:
- Registered in dispatch system for proper IR traversal
- Updated all device lowering passes (sync analysis, trivial broadcast, non-divisible split)
- Integrated with scheduler (pointwise, pointwise_non_tma) for special handling of quantization ops
- Modified fusion segmenter to erase allocation domain (Transform Replay cannot handle padding transformations)
- Updated logical domain mapping, tensor metadata, and kernel handling
Testing:
- Added
GroupedBlockQuantizeOptest case that validates againstBlockQuantizationOpreference and verifies grouped layout with proper padding/swizzling
Issue Found
Critical Bug in Index Lowering (csrc/device_lower/pass/index.cpp:488):
The block_size parameter is hardcoded to 16 when creating the lowered GroupedBlockQuantizationOp, but it should use grouped_bqop->blockSize() to respect the original operation's block_size parameter. This means any non-default block size will be ignored during compilation.
Confidence Score: 3/5
- This PR has one critical logic bug that needs to be fixed before merging
- The implementation is comprehensive and well-structured with proper validation, dispatch registration, and test coverage. However, there is a confirmed logic bug in the index lowering pass where block_size is hardcoded to 16 instead of using the operation's actual block_size parameter. This will cause incorrect behavior for any non-default block sizes. The rest of the implementation appears solid with thorough integration across all compiler passes.
- Pay close attention to csrc/device_lower/pass/index.cpp - the hardcoded block_size needs to be fixed to use grouped_bqop->blockSize()
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| csrc/ir/composite_nodes.h | 5/5 | Adds GroupedBlockQuantizationOp class declaration with proper constructor, accessors (blockScales, quantizedOutput, in, blockSize, hasGlobalScale, globalScale, inputOffsets, outputOffsets, k, g, layout), and evaluation methods |
| csrc/ir/composite_nodes.cpp | 5/5 | Implements GroupedBlockQuantizationOp constructor, toString/toInlineString methods, and evaluate placeholder (throws as fallback kernel not yet implemented) |
| csrc/codegen.cpp | 4/5 | Adds codegen handler for GroupedBlockQuantizationOp that validates group size, builds template/function arguments including layout parameters (32, 4, 4 for Block128x4), and generates call to bq::grouped_block_quantize_to_nvfp4 runtime function |
| csrc/device_lower/pass/index.cpp | 3/5 | Implements index lowering for GroupedBlockQuantizationOp, creates TensorIndex nodes, validates input divisibility by block size, computes logical indices - contains bug where block_size is hardcoded to 16 instead of using grouped_bqop->blockSize() |
| csrc/device_lower/validation.cpp | 5/5 | Adds comprehensive validation for GroupedBlockQuantizationOp including memory type checks, parallelization requirements (TIDx, BIDx, Group ID), extent validation, and schedule ordering constraints |
| csrc/dispatch.h | 5/5 | Registers GroupedBlockQuantizationOp in DISPATCH_FOR_ALL_EXPRS macro for IR dispatch system integration |
| csrc/ops/arith.cpp | 5/5 | Implements groupedBlockQuantize API function that creates quantized tensor and block scales outputs with proper allocation domains, then instantiates GroupedBlockQuantizationOp |
| csrc/ops/arith.h | 5/5 | Declares groupedBlockQuantize API function with parameters for input, offsets, layout, global scaling factor, block size, and output dtype |
| tests/cpp/test_layout_op.cpp | 5/5 | Adds GroupedBlockQuantizeOp test case that validates quantized output and block scales against BlockQuantizationOp reference, verifies grouped layout transformation with proper padding/swizzling |
| csrc/fusion_segmenter.cpp | 5/5 | Updates fusion segmentation to erase allocation domain for GroupedBlockQuantizationOp's blockScales output (similar to PreprocessGroupedMatmulInputSf) since Transform Replay cannot handle allocation domain transformations with padding |
| csrc/scheduler/pointwise.cpp | 5/5 | Updates scheduler to detect GroupedBlockQuantizationOp (alongside BlockQuantizationOp) for special handling of block quantization operations in pointwise scheduling |
Sequence Diagram
sequenceDiagram
participant User as User Code
participant API as groupedBlockQuantize API
participant IR as GroupedBlockQuantizationOp
participant Scheduler as Scheduler/Validation
participant IndexLower as Index Lowering
participant Codegen as Code Generator
participant Runtime as Runtime Function
User->>API: Call groupedBlockQuantize(input, offsets, layout, ...)
API->>API: Create quantized_tensor output
API->>API: Create block_scales output with allocation domain
API->>IR: Create GroupedBlockQuantizationOp
IR->>IR: Store inputs, offsets, layout, k, g, block_size
Note over Scheduler: Compilation Phase
Scheduler->>IR: Validate operation
Scheduler->>Scheduler: Check memory types (Local)
Scheduler->>Scheduler: Validate parallelization (TIDx, BIDx, Group)
Scheduler->>Scheduler: Check group size (2/4/8 for half, 2/4 for float)
IndexLower->>IR: Lower indices
IndexLower->>IndexLower: Compute logical indices for row/col
IndexLower->>IndexLower: Validate input divisible by block_size
IndexLower->>IR: Create lowered GroupedBlockQuantizationOp
Note over IndexLower: BUG: Hardcodes block_size=16
Codegen->>IR: Generate code
Codegen->>Codegen: Extract group_size from loop domain
Codegen->>Codegen: Build template args (has_global_scale, 32, 4, 4, group_size)
Codegen->>Codegen: Build function args (input, output, scales, indices, offsets, k, g)
Codegen->>Runtime: Call bq::grouped_block_quantize_to_nvfp4<...>(...)
Runtime->>Runtime: Perform quantization with layout transformation
Runtime-->>User: Return quantized_tensor and block_scales
csrc/device_lower/pass/index.cpp
Outdated
| grouped_bqop->k(), | ||
| grouped_bqop->g(), | ||
| grouped_bqop->globalScale(), | ||
| 16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the block_size parameter is hardcoded to 16, but it should use grouped_bqop->blockSize() to respect the original operation's block_size parameter
| 16, | |
| grouped_bqop->blockSize(), |
The GroupedBlockQuantizationOp constructor accepts a block_size parameter (line 1063 in composite_nodes.h), and the operation stores this value as an attribute accessible via blockSize() method (line 1081-1083). However, during index lowering, this value is being replaced with a hardcoded 16, which means any non-default block size specified by the user will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR adds comprehensive codegen support for GroupedBlockQuantizationOp, a new IR node that merges BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation for improved performance in grouped matrix multiplication scenarios.
Key Changes:
-
IR Node Implementation (
csrc/ir/composite_nodes.h/cpp): NewGroupedBlockQuantizationOpclass with inputs (input tensor, input/output offsets, k, g, optional global_scale) and outputs (quantized tensor, block scales). The operation stores block size and layout as attributes, with row/col indices added during lowering. -
Code Generation (
csrc/codegen.cpp): Handler generates runtime function call tobq::grouped_block_quantize_to_nvfp4with template parameters for layout configuration (block_row_outer=32, block_row_inner=4, block_col=4 for Block128x4) and proper function arguments including offset tensors and dimension scalars. -
Index Lowering (
csrc/device_lower/pass/index.cpp): Computes logical indices for the 2D matrix and validates that the inner dimension is divisible by the block size before creating the lowered operation. -
API Function (
csrc/ops/arith.cpp/.h):groupedBlockQuantize()function with validation for supported output types (Float4_e2m1fn with block_size=16, Float8_e4m3fn with block_size=32) and proper tensor domain construction with layout-specific allocation domains. -
Compiler Integration: Updates to dispatch macros, logical domain mapping, broadcast domain tracking, and scheduler to properly handle the new operation throughout the compilation pipeline.
-
Testing (
tests/cpp/test_layout_op.cpp): Comprehensive test validating both quantized output correctness and proper block scaling factor layout with grouped operations.
Critical Issue Found: Index lowering hardcodes block_size to 16 (line 488) instead of using grouped_bqop->blockSize(), which will break Float8_e4m3fn quantization that requires block_size=32.
Confidence Score: 4/5
- This PR is generally safe to merge after fixing the hardcoded block_size bug, as it follows established patterns and has good test coverage
- Score reflects one critical logic bug (hardcoded block_size=16 in index lowering) that breaks Float8_e4m3fn support. Otherwise, the implementation is thorough and well-integrated across the codebase with consistent patterns matching PreprocessGroupedMatmulInputSf and BlockQuantizationOp. The changes are localized and the test validates the main use case.
- Pay close attention to csrc/device_lower/pass/index.cpp line 488 - the hardcoded block_size must be changed to grouped_bqop->blockSize() before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| csrc/ir/composite_nodes.h | 5/5 | Adds GroupedBlockQuantizationOp class declaration with proper accessors for inputs, outputs, and attributes including layout, block size, and offset tensors |
| csrc/ir/composite_nodes.cpp | 5/5 | Implements GroupedBlockQuantizationOp constructor, toString methods, and evaluation placeholder - correctly manages inputs/outputs and attributes |
| csrc/codegen.cpp | 5/5 | Adds codegen handler for GroupedBlockQuantizationOp that generates runtime function call with template parameters for layout and correct function arguments |
| csrc/device_lower/pass/index.cpp | 3/5 | Implements index lowering for GroupedBlockQuantizationOp with logical index computation - contains hardcoded block_size bug at line 488 |
| csrc/ops/arith.cpp | 5/5 | Implements groupedBlockQuantize API function with proper validation for block sizes (16 for nvfp4, 32 for mxfp8) and tensor domain construction with layout allocation |
| tests/cpp/test_layout_op.cpp | 5/5 | Adds comprehensive test for GroupedBlockQuantizationOp validating quantized output and block scaling factor layout correctness with grouped operations |
Sequence Diagram
sequenceDiagram
participant User as User/Python API
participant API as ops/arith.cpp
participant IR as IR Builder
participant Fusion as Fusion Graph
participant Scheduler as Scheduler
participant IndexLower as Index Lowering
participant Codegen as Code Generator
participant Runtime as Runtime Kernel
User->>API: groupedBlockQuantize(input, offsets, layout)
API->>API: Validate inputs & block size
API->>IR: Create tensor domains for output & scales
IR->>IR: Build allocation domain with layout
API->>Fusion: IrBuilder::create<GroupedBlockQuantizationOp>
Fusion->>Fusion: Add GroupedBlockQuantizationOp to graph
Note over Scheduler: Compilation Phase
Scheduler->>Scheduler: Detect GroupedBlockQuantizationOp
Scheduler->>Scheduler: Cap vectorization factor (≤4)
Scheduler->>Scheduler: Apply pointwise scheduling
Note over IndexLower: Lowering Phase
IndexLower->>IndexLower: handle(GroupedBlockQuantizationOp)
IndexLower->>IndexLower: Compute logical indices [row, col]
IndexLower->>IndexLower: Validate block size divisibility
IndexLower->>Fusion: Create lowered op with indices
Note over Codegen: Code Generation
Codegen->>Codegen: handle(GroupedBlockQuantizationOp)
Codegen->>Codegen: Extract group_size from loop domain
Codegen->>Codegen: Build template args (layout params)
Codegen->>Codegen: Build function args (tensors, offsets, k, g)
Codegen->>Runtime: Generate call to bq::grouped_block_quantize_to_nvfp4
Note over Runtime: Execution
Runtime->>Runtime: Perform block quantization per group
Runtime->>Runtime: Apply layout transformation (Block128x4)
Runtime->>User: Return quantized tensor & block scales
|
!test |
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Overview
This PR adds codegen support for GroupedBlockQuantizationOp, a new IR node that merges BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation. This optimization enables single-kernel quantization and layout handling for grouped tensors in nvfp4 quantization workflows.
Key Changes
Core IR Implementation
- New IR Node:
GroupedBlockQuantizationOpclass added tocomposite_nodes.{h,cpp}with proper constructor, accessors, and attribute management - Input/Output Structure: Takes input tensor, input_offsets, output_offsets, k, g scalars, optional global_scale, plus row_idx/col_idx for indexing
- Attributes: Stores block_size (index 0) and layout (index 1) as data attributes, plus optional row_idx/col_idx as regular attributes
Codegen and Lowering
- Code Generation: Comprehensive handler in
codegen.cppthat generates calls tobq::grouped_block_quantize_to_nvfp4with proper template arguments and validation - Index Lowering:
device_lower/pass/index.cppimplements index lowering with runtime validation that inner dimension is divisible by block_size - Validation: Extensive validation in
device_lower/validation.cppchecks memory types, parallelization requirements (TIDx, BIDx, Group), and prevents z-axis parallelization
Integration Points
- Dispatch: Registered in
DISPATCH_FOR_ALL_EXPRSmacro indispatch.h - API:
groupedBlockQuantize()function inops/arith.{h,cpp}with input validation for data types, tensor dimensions, and block size requirements - Scheduler: Updates to
scheduler/utils.cppto handle offset tensors and block scales in caching logic - Other Files: Propagated through logical_domain_map, tensor_metadata, fusion_segmenter, and various analysis passes
Testing
- Test Coverage: Single test case
GroupedBlockQuantizeOpintest_layout_op.cppvalidates correctness against reference implementation usingBlockQuantizationOp - Test Scenario: 3 groups with [100, 150, 262] tokens, verifies both quantized output and block scaling factor layout
Architecture Consistency
The implementation follows established patterns from BlockQuantizationOp and PreprocessGroupedMatmulInputSf:
- Similar attribute indexing patterns (adjusted for different parameter counts)
- Consistent validation approach across lowering passes
- Proper integration with scheduler and domain mapping infrastructure
Code Quality
✅ Strengths:
- Comprehensive validation at multiple stages (API, index lowering, device validation)
- Proper error messages with context
- Consistent with existing codebase patterns
- Well-structured separation of concerns
- Currently only supports
Float4_e2m1fnoutput (enforced in codegen.cpp:2003-2005) - Evaluation method not implemented (placeholder throws, which is acceptable for ops with runtime kernel implementations)
- Single test case covers basic functionality but could benefit from additional edge case testing
Recommendation
The implementation is solid and ready to merge. The code is well-structured, properly validated, and consistently integrated across the codebase. The limitation to nvfp4 output is documented and intentional for this PR.
Confidence Score: 5/5
- This PR is safe to merge with minimal risk - comprehensive validation, consistent patterns, and proper testing
- Score of 5 reflects thorough implementation with validation at every stage, consistent integration patterns matching existing ops, comprehensive error handling, and working test coverage. No critical issues found.
- No files require special attention - implementation is consistent and well-validated across all changed files
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| csrc/ir/composite_nodes.h | 5/5 | Adds GroupedBlockQuantizationOp class definition with proper accessor methods and attributes for merging quantization and layout operations |
| csrc/ir/composite_nodes.cpp | 5/5 | Implements GroupedBlockQuantizationOp constructor, toString methods, and placeholder evaluate; follows existing patterns from BlockQuantizationOp |
| csrc/codegen.cpp | 5/5 | Adds codegen handler for GroupedBlockQuantizationOp that generates runtime function calls with proper template args and validation |
| csrc/device_lower/pass/index.cpp | 5/5 | Implements index lowering for GroupedBlockQuantizationOp with validation that inner dimension is divisible by block size |
| csrc/device_lower/validation.cpp | 5/5 | Adds comprehensive validation for GroupedBlockQuantizationOp including memory type checks, parallelization requirements, and group dimension verification |
| csrc/ops/arith.cpp | 5/5 | Implements groupedBlockQuantize with comprehensive input validation, output tensor creation, and proper attribute setup |
| csrc/dispatch.h | 5/5 | Registers GroupedBlockQuantizationOp in dispatcher macro for proper IR node handling across the codebase |
| csrc/scheduler/utils.cpp | 5/5 | Updates scheduler utilities to handle GroupedBlockQuantizationOp offset tensors and block scales correctly in caching logic |
| tests/cpp/test_layout_op.cpp | 5/5 | Adds test case for GroupedBlockQuantizeOp verifying correct quantization and grouped layout transformation against reference |
Sequence Diagram
sequenceDiagram
participant User as Python/C++ API
participant API as groupedBlockQuantize()
participant IR as GroupedBlockQuantizationOp
participant Scheduler as Scheduler
participant IndexLower as Index Lowering
participant Validation as Device Validation
participant Codegen as Code Generator
participant Runtime as CUDA Runtime
User->>API: Call groupedBlockQuantize(input, offsets, layout)
API->>API: Validate input dtype (Float/BF16/Half)
API->>API: Validate 2D tensor
API->>API: Check block_size (16 for nvfp4, 32 for mxfp8)
API->>IR: Create GroupedBlockQuantizationOp node
IR->>IR: Store inputs: input, offsets, k, g
IR->>IR: Store attributes: block_size, layout
IR->>Scheduler: Schedule fusion
Scheduler->>Scheduler: Apply pointwise scheduler
Scheduler->>Scheduler: Handle offset tensors in caching
Scheduler->>IndexLower: Lower to kernel IR
IndexLower->>IndexLower: Compute logical indices
IndexLower->>IndexLower: Validate inner_dim % block_size == 0
IndexLower->>IR: Create lowered GroupedBlockQuantizationOp
IR->>Validation: Validate lowered op
Validation->>Validation: Check MemoryType::Local
Validation->>Validation: Verify TIDx and BIDx present
Validation->>Validation: Check Group dimension exists
Validation->>Validation: Ensure no z-axis parallelization
Validation->>Codegen: Pass validated op
Codegen->>Codegen: Extract group_size from loop domain
Codegen->>Codegen: Validate group_size (2/4/8 for half, 2/4 for float)
Codegen->>Codegen: Build template args (layout params, group_size)
Codegen->>Codegen: Build function args (tensors, indices, offsets)
Codegen->>Runtime: Generate call to bq::grouped_block_quantize_to_nvfp4
Runtime->>Runtime: Execute quantization kernel
Runtime-->>User: Return quantized_tensor + block_scales
Context
The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors.
Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels
The series of PRs tries to merge the two operation into a single one.
Stacked PRs
#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark
What's in this PR
Adding Fusion IR node GroupedBlockQuantizationOp. The operation is a combination of BlockQuantizationOp and PreprocessGroupedMatmulInputSf, where it inherits all the validation / checks from the two operations.
The operation is similar to BlockQuantizationOp, with the exception that:
i. The block scaling factor output doesn't have the swizzle logic represented as allocation domain transformations;
ii. It takes an additional inputs (input_offsets and output_offsets) to facilitate group indexing, similar to PreprocessGroupedMatmulInputSf.
Adding cpp test case for GroupedBlockQuantizationOp.