Skip to content

Conversation

@jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Jan 8, 2026

Context

The series of PRs is trying to enable a single kernel for quantization and layout handling of block scaling factor on grouped tensors.

Existing solution for nvfp4 quantization of activation Tensor for grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout required by grouped_mm kernels

The series of PRs tries to merge the two operation into a single one.

Stacked PRs

#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating llama4 benchmark

What's in this PR

  1. Adding Fusion IR node GroupedBlockQuantizationOp. The operation is a combination of BlockQuantizationOp and PreprocessGroupedMatmulInputSf, where it inherits all the validation / checks from the two operations.
    The operation is similar to BlockQuantizationOp, with the exception that:
    i. The block scaling factor output doesn't have the swizzle logic represented as allocation domain transformations;
    ii. It takes an additional inputs (input_offsets and output_offsets) to facilitate group indexing, similar to PreprocessGroupedMatmulInputSf.

  2. Adding cpp test case for GroupedBlockQuantizationOp.

1. refactor existing block_layout op and block_quantization_kernel to re-use existing runtime functions;
2. added runtime function for GroupedBlockQuantizeOp
@jjsjann123 jjsjann123 changed the base branch from main to jj/grouped_block_quantize_op_0 January 8, 2026 00:36
@jjsjann123 jjsjann123 changed the title Jj/grouped block quantize op 1 PR1: adding codegen support for GroupedBlockQuantizationOp Jan 8, 2026
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Review updated until commit 3051db0

Description

  • Add GroupedBlockQuantizationOp IR node combining BlockQuantizationOp and PreprocessGroupedMatmulInputSf

  • Implement codegen support with runtime function call generation for grouped quantization

  • Integrate new operation across device lowering, validation, and scheduler components

  • Add test case validating grouped block quantization with offset-based group indexing

Changes walkthrough

Relevant files
Enhancement
25 files
codegen.cpp
Add GroupedBlockQuantizationOp codegen handler with runtime function
calls
+114/-0 
composite_nodes.h
Define GroupedBlockQuantizationOp class interface and methods
+92/-0   
composite_nodes.cpp
Implement GroupedBlockQuantizationOp constructor and string methods
+58/-0   
arith.h
Add groupedBlockQuantize API declaration                                 
+9/-0     
arith.cpp
Implement groupedBlockQuantize function with validation and tensor
creation
+141/-0 
validation.cpp
Add GroupedBlockQuantizationOp validation with scheduling requirements
+196/-1 
index.cpp
Implement IndexLowering handler for GroupedBlockQuantizationOp
+54/-0   
trivial_broadcast.cpp
Add GroupedBlockQuantizationOp broadcast domain handling 
+11/-0   
sync_information.cpp
Update SyncMap to handle GroupedBlockQuantizationOp block scales
+10/-5   
non_divisible_split.cpp
Update NonDivisiblePredicateInfo for GroupedBlockQuantizationOp
+6/-1     
pointwise.cpp
Integrate GroupedBlockQuantizationOp detection in pointwise scheduler
+23/-1   
pointwise_non_tma.cpp
Update vectorization logic for GroupedBlockQuantizationOp outputs
+8/-1     
utils.cpp
Update cache utilities to handle GroupedBlockQuantizationOp offsets
+12/-7   
registry_utils.cpp
Add GroupedBlockQuantizationOp checks for scheduler topology
+20/-0   
logical_domain_map.cpp
Update domain mapping for GroupedBlockQuantizationOp inputs
+29/-9   
domain_map.cpp
Update domain map validation for GroupedBlockQuantizationOp
+13/-0   
tensor_metadata.cpp
Add allocation validation skip for GroupedBlockQuantizationOp block
scales
+6/-0     
kernel.cpp
Update KernelIrScanner to detect GroupedBlockQuantizationOp
+4/-0     
utils.cpp
Update isTvOp to include GroupedBlockQuantizationOp           
+1/-0     
utils.cpp
Update hasUniformSiblings to include GroupedBlockQuantizationOp
+5/-1     
fusion_segmenter.cpp
Update fusion segmentation for GroupedBlockQuantizationOp block scales
+6/-1     
trivial_broadcast.h
Add GroupedBlockQuantizationOp handler declaration             
+2/-0     
index.h
Add GroupedBlockQuantizationOp handler declaration             
+1/-0     
dispatch.h
Add GroupedBlockQuantizationOp to dispatch macro                 
+1/-0     
logical_domain_map.h
Add GroupedBlockQuantizationOp handler declaration             
+4/-0     
Tests
1 files
test_layout_op.cpp
Add GroupedBlockQuantizeOp test case with validation         
+69/-1   

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Runtime Function Call Validation

The GroupedBlockQuantizationOp handler calls runtime function 'bq::grouped_block_quantize_to_nvfp4' but there's no validation that this runtime function actually exists or is properly linked. Consider adding a check or fallback mechanism.

indent() << genCall(
                "bq::grouped_block_quantize_to_nvfp4",
                template_args,
                func_args)
         << ";\n";
Complex Validation Logic

The validation for GroupedBlockQuantizationOp is quite extensive (lines 868-1058) and includes complex scheduling requirements. This could be brittle and may need more comprehensive testing with different tensor shapes and layouts.

void handle(GroupedBlockQuantizationOp* bqop) final {
  auto inp_tv = bqop->input(0)->as<TensorView>();
  auto quantized_output = bqop->quantizedOutput()->as<TensorView>();
  auto block_scaling_factor = bqop->blockScales()->as<TensorView>();
  auto output_dtype = quantized_output->dtype();

  NVF_ERROR_EQ(
      inp_tv->getMemoryType(),
      MemoryType::Local,
      "Input must be a local memory tensor. Found: ",
      inp_tv->getMemoryType());

  NVF_ERROR_EQ(
      quantized_output->getMemoryType(),
      MemoryType::Local,
      "Quantized output must be a local memory tensor. Found: ",
      quantized_output->getMemoryType());

  NVF_ERROR_EQ(
      block_scaling_factor->getMemoryType(),
      MemoryType::Global,
      "Block scaling factor must be a global memory tensor. Found: ",
      block_scaling_factor->getMemoryType());

  NVF_ERROR(
      output_dtype != DataType::Float8_e4m3fn,
      "output of Float8_e4m3fn is not yet implemented");

  if (bqop->hasGlobalScale()) {
    auto global_scale = bqop->globalScale()->as<TensorView>();

    NVF_ERROR_EQ(
        global_scale->getMemoryType(),
        MemoryType::Global,
        "Global scaling factor must be a global memory tensor. Found: ",
        global_scale->getMemoryType());

    NVF_ERROR_EQ(
        global_scale->dtype(),
        DataType::Float,
        "Global scaling factor must be of type float. Found: ",
        global_scale->dtype());
  }

  // Outputs have the same allocation domain
  // as the logical domain - no allocation domain.
  NVF_ERROR(
      !quantized_output->hasAllocation(),
      "Quantized output must not have an allocation domain.");

  IterDomain* grouped_id = nullptr;
  IterDomain* thread_x = nullptr;
  IterDomain* block_x = nullptr;
  IterDomain* thread_z = nullptr;
  IterDomain* block_z = nullptr;

  for (const auto& loop_id : quantized_output->getLoopDomain()) {
    if (loop_id->getParallelType() == ParallelType::Group) {
      grouped_id = loop_id;
    } else if (loop_id->getParallelType() == ParallelType::TIDx) {
      thread_x = loop_id;
    } else if (loop_id->getParallelType() == ParallelType::BIDx) {
      block_x = loop_id;
    } else if (loop_id->getParallelType() == ParallelType::TIDz) {
      thread_z = loop_id;
    } else if (loop_id->getParallelType() == ParallelType::BIDz) {
      block_z = loop_id;
    } else if (
        loop_id->getParallelType() == ParallelType::Serial ||
        loop_id->getParallelType() == ParallelType::Unswitch ||
        loop_id->getParallelType() == ParallelType::Unroll) {
      // Check this is ID has a constant extent and is 1
      NVF_ERROR(
          loop_id->extent()->isConstInt(),
          "Expected constant extent for Serial/Unswitch/Unroll ID in "
          "GroupedBlockQuantizationOp");
      NVF_ERROR_EQ(
          loop_id->extent()->evaluate().as<int64_t>(),
          1,
          "Expected non-TID/BID/Group ID to have extent of 1 for "
          "GroupedBlockQuantizationOp: ",
          bqop->toString());
    }
  }

  NVF_ERROR(
      grouped_id != nullptr,
      "One of the output IDs must be grouped for "
      "GroupedBlockQuantizationOp: ",
      bqop->toString());

  NVF_ERROR(
      thread_x != nullptr && block_x != nullptr,
      "Need to have both TIDx and BIDx when using "
      "GroupedBlockQuantizationOp: ",
      bqop->toString());

  NVF_ERROR(
      !thread_z && !block_z,
      "Parallelization along z axis is not supported for "
      "GroupedBlockQuantizationOp: ",
      bqop->toString());

  auto inner_extent = grouped_id->extent()->evaluate().as<int64_t>();
  auto input_dtype = inp_tv->dtype();

  NVF_ERROR(
      ((inner_extent == 4 || inner_extent == 2) &&
       input_dtype == DataType::Float) ||
          ((inner_extent == 8 || inner_extent == 4 || inner_extent == 2) &&
           (input_dtype == DataType::BFloat16 ||
            input_dtype == DataType::Half)),
      "The group dimension must be  2/4 (FP32) or 2/4/8 "
      "(BF16). Found: ",
      inner_extent,
      ". Expr: ",
      bqop->toString());

  // see [ NOTE: check scheduling requirements for block quantization ]
  auto transform_exprs = DependencyCheck::getAllExprsBetween(
      {quantized_output->getLogicalDomain().begin(),
       quantized_output->getLogicalDomain().end()},
      {quantized_output->getLoopDomain().begin(),
       quantized_output->getLoopDomain().end()});

  std::vector<IterDomain*> ids_to_transform =
      quantized_output->getLogicalDomain();

  std::deque<IterDomain*> frontier(
      quantized_output->getLogicalDomain().begin(),
      quantized_output->getLogicalDomain().end());

  // This will get the xforms from logical to loop and apply them on the
  // logical domain. We will get a loop domain minus the reordering.
  // This pass also removes all IDs from frontier that were derived using
  // non-contiguous merges.
  scheduler_utils::applyTransforms(
      ids_to_transform, transform_exprs, [&frontier](Expr* expr) {
        traverseFrontierWithContiguityCheck(frontier, expr);
      });

  // The grouped ID must correspond to the innermost loop-like domain
  NVF_ERROR(
      ids_to_transform.back() == grouped_id,
      "The grouped ID must correspond to the innermost of all splits "
      "from logical domains to loop domains for GroupedBlockQuantizationOp. "
      "TV: ",
      quantized_output->toString());

  // Iterate from the back to find TIDx, skipping group_id (last element)
  // Ensure all IDs between group_id and TIDx have extent 1
  bool found_tidx = false;
  for (auto it = ids_to_transform.rbegin() + 1; it != ids_to_transform.rend();
       ++it) {
    if (*it == thread_x) {
      found_tidx = true;
      break;
    }
    // All non-TIDx IDs between Group ID and TIDx must have extent of 1
    NVF_ERROR(
        (*it)->extent()->isConstInt() &&
            (*it)->extent()->evaluate().as<int64_t>() == 1,
        "Expected IDs between Group ID and TIDx to have extent of 1 for "
        "GroupedBlockQuantizationOp: ",
        quantized_output->toString());
  }

  NVF_ERROR(
      found_tidx,
      "TIDx must follow the Group ID in the schedule for "
      "GroupedBlockQuantizationOp: ",
      quantized_output->toString());

  // Check if grouped_id in frontier
  auto grouped_it = std::ranges::find(frontier, grouped_id);
  NVF_ERROR(
      grouped_it != frontier.end(),
      "All merge operations deriving the grouped ID must combine "
      "contiguous IDs from the logical domain for "
      "GroupedBlockQuantizationOp: ",
      quantized_output->toString());
  // Do the same for thread_x
  auto threadx_it =
      std::ranges::find(frontier.begin(), frontier.end(), thread_x);
  NVF_ERROR(
      threadx_it != frontier.end(),
      "All merge operations deriving the TIDx ID must combine "
      "contiguous IDs from the logical domain for "
      "GroupedBlockQuantizationOp: ",
      quantized_output->toString());
}
Limited Test Coverage

While a test is added for GroupedBlockQuantizationOp, it only tests one specific configuration (Block128x4 layout). Consider adding tests for different layouts, data types, and edge cases to ensure robustness.

TEST_F(LayoutOpTest, GroupedBlockQuantizeOp) {
  auto fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto inp = makeSymbolicTensor(2);
  auto offsets = makeSymbolicTensor(1, DataType::Int32);
  auto rounded_offsets = makeSymbolicTensor(1, DataType::Int32);
  fusion.addInput(inp);
  fusion.addInput(offsets);
  fusion.addInput(rounded_offsets);

  auto outs = groupedBlockQuantize(
      inp, offsets, rounded_offsets, BlockScalingFactorLayout::Block128x4);
  fusion.addOutput(castOp(DataType::Float, outs.quantized_tensor));
  fusion.addOutput(outs.block_scales);

  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
  int m = 512;
  int k = 9 * 16; // note: padded column size needs to be a multiple of 16
  auto t0 = at::randn({m, k}, options);

  // tokens per group are [100, 150, 262] respectively, so each group would be
  // padded to multiple of 128. Hence the total output row span would cover a
  // length of 128 + 256 + 384 = 768.
  auto t1 = at::tensor({0, 100, 250}, options.dtype(at::kInt));
  auto t2 = at::tensor({0, 128, 384}, options.dtype(at::kInt));

  // automatic scheduling.
  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs({t0, t1, t2});

  at::Tensor ref_block_sf;
  at::Tensor ref_scaled_out;
  // producing reference
  {
    std::unique_ptr<Fusion> fusion_new_op = std::make_unique<Fusion>();
    FusionGuard fg2(fusion_new_op.get());
    auto tv_in = makeContigTensor(2);
    fusion_new_op->addInput(tv_in);
    auto quantization_results =
        blockQuantize(tv_in, nullptr, /*block_size=*/16, false);

    fusion_new_op->addOutput(quantization_results.block_scales);
    fusion_new_op->addOutput(
        castOp(DataType::Float, quantization_results.quantized_tensor));
    FusionExecutorCache executor_cache(std::move(fusion_new_op));
    auto outputs_new_op = executor_cache.runFusionWithInputs({t0});
    ref_block_sf = outputs_new_op[0].as<at::Tensor>().to(at::kFloat);
    ref_scaled_out = outputs_new_op[1].as<at::Tensor>();
  }

  // check scaled output
  EXPECT_TRUE(at::allclose(ref_scaled_out, outputs[0].as<at::Tensor>()));
  // check block scaling factor
  ASSERT_TRUE(validateGroupedLayout(
      BlockScalingFactorLayout::Block128x4,
      outputs[1].as<at::Tensor>(),
      ref_block_sf,
      t1,
      t2));

  EXPECT_THAT(
      executor_cache.getMostRecentKernelRuntime()->fusionSegments()->groups(),
      UnorderedElementsAre(HeuristicIs(SchedulerType::PointWise)));
}

Test failures

  • (Medium, 2) nvFuser float8 kernel requires Blackwell (SM 10.0) – LayoutOpTest.GroupedBlockQuantizeOp failing on A100 & H100

    Test Name A100 H100 Source
    LayoutOpTest.GroupedBlockQuantizeOp Link

@jjsjann123 jjsjann123 changed the title PR1: adding codegen support for GroupedBlockQuantizationOp GroupedBlockQuantizeOp PR1: Adding codegen support Jan 8, 2026
@jjsjann123 jjsjann123 marked this pull request as ready for review January 8, 2026 02:17
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Greptile Summary

  • Adds GroupedBlockQuantizationOp IR node that merges BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation for performance optimization in grouped matrix multiplication scenarios
  • Implements comprehensive codegen support including dispatch registration, kernel handling, scheduling integration, and validation across all compiler passes
  • Includes test case validation ensuring the new grouped operation maintains correctness while enabling single-kernel quantization and layout handling

Important Files Changed

Filename Overview
csrc/ir/composite_nodes.h and csrc/ir/composite_nodes.cpp New GroupedBlockQuantizationOp class implementation with constructor, accessors, and evaluation methods combining quantization and layout handling functionality
csrc/codegen.cpp Added code generation handler for GroupedBlockQuantizationOp that generates runtime function calls with template arguments for block scaling layouts and group sizes
csrc/device_lower/pass/index.cpp Implemented index lowering handler with validation for 2D matrices and runtime divisibility checks for block size compatibility
csrc/device_lower/validation.cpp Added extensive validation logic duplicating BlockQuantizationOp constraints while supporting grouped indexing with ParallelType::Group
csrc/ops/arith.cpp and csrc/ops/arith.h New groupedBlockQuantize function implementation and declaration that creates IR nodes with proper domain setup and allocation handling

Confidence score: 4/5

  • This PR is generally safe to merge but requires careful review due to the complexity of the new composite operation
  • Score reflects the extensive changes across critical code generation and validation systems, though the implementation follows established patterns consistently
  • Pay close attention to the codegen handler in csrc/codegen.cpp and validation logic in csrc/device_lower/validation.cpp for correctness of the complex template argument and runtime function call generation

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

jjsjann123 added a commit that referenced this pull request Jan 9, 2026
## Context

The series of PRs is trying to enable a single kernel for quantization
and layout handling of block scaling factor on grouped tensors.

Existing solution for nvfp4 quantization of activation Tensor for
grouped_mm relies on two operation:
i. BlockQuantizationOp produces scaled_tv and block_scaling_factor.
ii. block_scaling_factor needs to be processed by
PreprocessGroupedMatmulInputSf in order to satisfy the swizzle layout
required by grouped_mm kernels

The series of PRs tries to merge the two operation into a single one.

### Stacked PRs

#5775 GroupedBlockQuantizationOp PR0: Adding runtime function
#5776 GroupedBlockQuantizationOp PR1: Adding codegen support
#5777 GroupedBlockQuantizationOp PR2: Adding python API and updating
llama4 benchmark

## What's in this PR

1. refactor existing runtime function for re-use by the new op;
2. added runtime function for GroupedBlockQuantizeOp.
Base automatically changed from jj/grouped_block_quantize_op_0 to main January 9, 2026 19:53
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds GroupedBlockQuantizationOp, a new IR node that merges the functionality of BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation. This optimization enables single-kernel quantization and layout handling for grouped matrix multiplication operations.

Key Implementation Details

Core Operation: The new op takes a high-precision input tensor and produces:

  1. A quantized output tensor (same shape as input)
  2. Block scaling factors with swizzled layout directly suitable for grouped_mm

Architecture: The implementation follows the standard pattern for composite operations:

  • IR node definition in composite_nodes.h/cpp
  • User-facing API in ops/arith.cpp
  • Index lowering in device_lower/pass/index.cpp adds logical indices
  • Codegen in codegen.cpp generates runtime function call
  • Comprehensive validation in device_lower/validation.cpp
  • Scheduler integration for pointwise scheduling
  • Test coverage validates correctness against reference implementation

Key Technical Points:

  • Supports Float4_e2m1fn (nvfp4) output with Block128x4 layout
  • Requires specific parallelization: TIDx, BIDx, and Group parallel types
  • Group dimension must be 2/4 for FP32 or 2/4/8 for BF16/FP16 inputs
  • Block scales output has allocation domain with padding for swizzled layout
  • Vectorization capped at 4 when this op is present

The implementation is thorough and integrates well across all compiler passes including dispatch registration, logical domain mapping, broadcast domain analysis, scheduler topology checks, and non-divisible split handling.

Confidence Score: 4/5

  • This PR is safe to merge with minor considerations around attribute access patterns
  • The implementation is comprehensive and follows existing patterns well. All necessary integration points are covered including dispatch, validation, index lowering, codegen, scheduler, and domain mapping. The test validates correctness. Score of 4 (not 5) reflects the complexity of the attribute access pattern where row_idx/col_idx are added during index lowering but accessed in codegen - while this works correctly, it creates a subtle dependency that could be error-prone if not well understood
  • Pay close attention to csrc/codegen.cpp and csrc/ops/arith.cpp - ensure the attribute indexing pattern (attributeVal(2) and attributeVal(3)) remains valid if constructor signature changes

Important Files Changed

File Analysis

Filename Score Overview
csrc/ir/composite_nodes.h 5/5 Added GroupedBlockQuantizationOp class definition with proper accessor methods, constructor signature, and evaluate method stub
csrc/ir/composite_nodes.cpp 5/5 Implemented GroupedBlockQuantizationOp constructor, toString, toInlineString, and evaluate methods following existing patterns
csrc/ops/arith.cpp 4/5 Implemented groupedBlockQuantize with proper validation, domain setup, and layout allocation - minor concern about row_idx/col_idx not being passed to initial op construction
csrc/codegen.cpp 4/5 Added codegen handler for GroupedBlockQuantizationOp with template args, validation, and runtime function call - assumes row_idx/col_idx attributes always present at indices 2 and 3
csrc/device_lower/pass/index.cpp 5/5 Added index lowering for GroupedBlockQuantizationOp with proper logical index computation and runtime validation
csrc/device_lower/validation.cpp 5/5 Added comprehensive validation for GroupedBlockQuantizationOp including memory type checks, parallelization requirements, and scheduling constraints
tests/cpp/test_layout_op.cpp 5/5 Added test for GroupedBlockQuantizeOp validating quantized output and block scaling factor layout against reference implementation

Sequence Diagram

sequenceDiagram
    participant User
    participant OpsAPI as ops/arith.cpp
    participant IRNode as GroupedBlockQuantizationOp
    participant IndexLower as device_lower/pass/index
    participant Validation as device_lower/validation
    participant Codegen as codegen.cpp
    participant Runtime as Runtime Function

    User->>OpsAPI: groupedBlockQuantize(input, offsets, layout)
    OpsAPI->>OpsAPI: Validate inputs & data types
    OpsAPI->>OpsAPI: Create logical & allocation domains
    OpsAPI->>IRNode: Create GroupedBlockQuantizationOp<br/>(without row_idx/col_idx)
    IRNode->>OpsAPI: Return quantized_tensor & block_scales
    
    Note over IndexLower: Device Lowering Phase
    IndexLower->>IndexLower: Compute logical indices
    IndexLower->>IndexLower: Validate inner dim divisibility
    IndexLower->>IRNode: Create lowered op<br/>(WITH row_idx/col_idx)
    
    Validation->>IRNode: Validate memory types
    Validation->>IRNode: Check parallelization (TIDx, BIDx, Group)
    Validation->>IRNode: Verify group dimension & contiguity
    
    Note over Codegen: Code Generation Phase
    Codegen->>IRNode: Extract group size from loop domain
    Codegen->>IRNode: Validate group size (2/4 or 2/4/8)
    Codegen->>IRNode: Access row_idx/col_idx via attributeVal(2,3)
    Codegen->>Runtime: Generate call to<br/>bq::grouped_block_quantize_to_nvfp4
    Runtime-->>User: Execute quantization kernel
Loading

Comment on lines 64 to +65
.slice(0, 0, m_g)
.slice(1, 0, k);
.slice(1, 0, k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition of .to(ref.dtype()) to ensure dtype matching in the validation. This handles the case where the reference and output might have different dtypes due to the layout transformation.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds comprehensive codegen support for GroupedBlockQuantizationOp, a new IR node that merges BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation for improved performance in grouped matrix multiplication quantization scenarios.

What Changed

Core IR Implementation:

  • Added GroupedBlockQuantizationOp class in composite_nodes.h/cpp with full constructor, accessors, and evaluation methods
  • The operation takes input tensor, input/output offsets, layout specification (Block128x4), k/g dimensions, optional global scale, and block_size parameter
  • Produces quantized output and block scaling factors with swizzled layout

Codegen & Lowering:

  • Implemented code generation handler that validates group sizes (2/4/8 for half-precision, 2/4 for float), builds template arguments for layout parameters (32, 4, 4), and generates calls to bq::grouped_block_quantize_to_nvfp4 runtime function
  • Added index lowering that creates TensorIndex nodes, validates input divisibility by block size, and computes logical indices
  • Comprehensive validation checks memory types, parallelization requirements (TIDx, BIDx, Group ID), and schedule ordering

Compiler Integration:

  • Registered in dispatch system for proper IR traversal
  • Updated all device lowering passes (sync analysis, trivial broadcast, non-divisible split)
  • Integrated with scheduler (pointwise, pointwise_non_tma) for special handling of quantization ops
  • Modified fusion segmenter to erase allocation domain (Transform Replay cannot handle padding transformations)
  • Updated logical domain mapping, tensor metadata, and kernel handling

Testing:

  • Added GroupedBlockQuantizeOp test case that validates against BlockQuantizationOp reference and verifies grouped layout with proper padding/swizzling

Issue Found

Critical Bug in Index Lowering (csrc/device_lower/pass/index.cpp:488):
The block_size parameter is hardcoded to 16 when creating the lowered GroupedBlockQuantizationOp, but it should use grouped_bqop->blockSize() to respect the original operation's block_size parameter. This means any non-default block size will be ignored during compilation.

Confidence Score: 3/5

  • This PR has one critical logic bug that needs to be fixed before merging
  • The implementation is comprehensive and well-structured with proper validation, dispatch registration, and test coverage. However, there is a confirmed logic bug in the index lowering pass where block_size is hardcoded to 16 instead of using the operation's actual block_size parameter. This will cause incorrect behavior for any non-default block sizes. The rest of the implementation appears solid with thorough integration across all compiler passes.
  • Pay close attention to csrc/device_lower/pass/index.cpp - the hardcoded block_size needs to be fixed to use grouped_bqop->blockSize()

Important Files Changed

File Analysis

Filename Score Overview
csrc/ir/composite_nodes.h 5/5 Adds GroupedBlockQuantizationOp class declaration with proper constructor, accessors (blockScales, quantizedOutput, in, blockSize, hasGlobalScale, globalScale, inputOffsets, outputOffsets, k, g, layout), and evaluation methods
csrc/ir/composite_nodes.cpp 5/5 Implements GroupedBlockQuantizationOp constructor, toString/toInlineString methods, and evaluate placeholder (throws as fallback kernel not yet implemented)
csrc/codegen.cpp 4/5 Adds codegen handler for GroupedBlockQuantizationOp that validates group size, builds template/function arguments including layout parameters (32, 4, 4 for Block128x4), and generates call to bq::grouped_block_quantize_to_nvfp4 runtime function
csrc/device_lower/pass/index.cpp 3/5 Implements index lowering for GroupedBlockQuantizationOp, creates TensorIndex nodes, validates input divisibility by block size, computes logical indices - contains bug where block_size is hardcoded to 16 instead of using grouped_bqop->blockSize()
csrc/device_lower/validation.cpp 5/5 Adds comprehensive validation for GroupedBlockQuantizationOp including memory type checks, parallelization requirements (TIDx, BIDx, Group ID), extent validation, and schedule ordering constraints
csrc/dispatch.h 5/5 Registers GroupedBlockQuantizationOp in DISPATCH_FOR_ALL_EXPRS macro for IR dispatch system integration
csrc/ops/arith.cpp 5/5 Implements groupedBlockQuantize API function that creates quantized tensor and block scales outputs with proper allocation domains, then instantiates GroupedBlockQuantizationOp
csrc/ops/arith.h 5/5 Declares groupedBlockQuantize API function with parameters for input, offsets, layout, global scaling factor, block size, and output dtype
tests/cpp/test_layout_op.cpp 5/5 Adds GroupedBlockQuantizeOp test case that validates quantized output and block scales against BlockQuantizationOp reference, verifies grouped layout transformation with proper padding/swizzling
csrc/fusion_segmenter.cpp 5/5 Updates fusion segmentation to erase allocation domain for GroupedBlockQuantizationOp's blockScales output (similar to PreprocessGroupedMatmulInputSf) since Transform Replay cannot handle allocation domain transformations with padding
csrc/scheduler/pointwise.cpp 5/5 Updates scheduler to detect GroupedBlockQuantizationOp (alongside BlockQuantizationOp) for special handling of block quantization operations in pointwise scheduling

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant API as groupedBlockQuantize API
    participant IR as GroupedBlockQuantizationOp
    participant Scheduler as Scheduler/Validation
    participant IndexLower as Index Lowering
    participant Codegen as Code Generator
    participant Runtime as Runtime Function

    User->>API: Call groupedBlockQuantize(input, offsets, layout, ...)
    API->>API: Create quantized_tensor output
    API->>API: Create block_scales output with allocation domain
    API->>IR: Create GroupedBlockQuantizationOp
    IR->>IR: Store inputs, offsets, layout, k, g, block_size
    
    Note over Scheduler: Compilation Phase
    Scheduler->>IR: Validate operation
    Scheduler->>Scheduler: Check memory types (Local)
    Scheduler->>Scheduler: Validate parallelization (TIDx, BIDx, Group)
    Scheduler->>Scheduler: Check group size (2/4/8 for half, 2/4 for float)
    
    IndexLower->>IR: Lower indices
    IndexLower->>IndexLower: Compute logical indices for row/col
    IndexLower->>IndexLower: Validate input divisible by block_size
    IndexLower->>IR: Create lowered GroupedBlockQuantizationOp
    Note over IndexLower: BUG: Hardcodes block_size=16
    
    Codegen->>IR: Generate code
    Codegen->>Codegen: Extract group_size from loop domain
    Codegen->>Codegen: Build template args (has_global_scale, 32, 4, 4, group_size)
    Codegen->>Codegen: Build function args (input, output, scales, indices, offsets, k, g)
    Codegen->>Runtime: Call bq::grouped_block_quantize_to_nvfp4<...>(...)
    
    Runtime->>Runtime: Perform quantization with layout transformation
    Runtime-->>User: Return quantized_tensor and block_scales
Loading

grouped_bqop->k(),
grouped_bqop->g(),
grouped_bqop->globalScale(),
16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the block_size parameter is hardcoded to 16, but it should use grouped_bqop->blockSize() to respect the original operation's block_size parameter

Suggested change
16,
grouped_bqop->blockSize(),

The GroupedBlockQuantizationOp constructor accepts a block_size parameter (line 1063 in composite_nodes.h), and the operation stores this value as an attribute accessible via blockSize() method (line 1081-1083). However, during index lowering, this value is being replaced with a hardcoded 16, which means any non-default block size specified by the user will be ignored.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds comprehensive codegen support for GroupedBlockQuantizationOp, a new IR node that merges BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation for improved performance in grouped matrix multiplication scenarios.

Key Changes:

  1. IR Node Implementation (csrc/ir/composite_nodes.h/cpp): New GroupedBlockQuantizationOp class with inputs (input tensor, input/output offsets, k, g, optional global_scale) and outputs (quantized tensor, block scales). The operation stores block size and layout as attributes, with row/col indices added during lowering.

  2. Code Generation (csrc/codegen.cpp): Handler generates runtime function call to bq::grouped_block_quantize_to_nvfp4 with template parameters for layout configuration (block_row_outer=32, block_row_inner=4, block_col=4 for Block128x4) and proper function arguments including offset tensors and dimension scalars.

  3. Index Lowering (csrc/device_lower/pass/index.cpp): Computes logical indices for the 2D matrix and validates that the inner dimension is divisible by the block size before creating the lowered operation.

  4. API Function (csrc/ops/arith.cpp/.h): groupedBlockQuantize() function with validation for supported output types (Float4_e2m1fn with block_size=16, Float8_e4m3fn with block_size=32) and proper tensor domain construction with layout-specific allocation domains.

  5. Compiler Integration: Updates to dispatch macros, logical domain mapping, broadcast domain tracking, and scheduler to properly handle the new operation throughout the compilation pipeline.

  6. Testing (tests/cpp/test_layout_op.cpp): Comprehensive test validating both quantized output correctness and proper block scaling factor layout with grouped operations.

Critical Issue Found: Index lowering hardcodes block_size to 16 (line 488) instead of using grouped_bqop->blockSize(), which will break Float8_e4m3fn quantization that requires block_size=32.

Confidence Score: 4/5

  • This PR is generally safe to merge after fixing the hardcoded block_size bug, as it follows established patterns and has good test coverage
  • Score reflects one critical logic bug (hardcoded block_size=16 in index lowering) that breaks Float8_e4m3fn support. Otherwise, the implementation is thorough and well-integrated across the codebase with consistent patterns matching PreprocessGroupedMatmulInputSf and BlockQuantizationOp. The changes are localized and the test validates the main use case.
  • Pay close attention to csrc/device_lower/pass/index.cpp line 488 - the hardcoded block_size must be changed to grouped_bqop->blockSize() before merging

Important Files Changed

File Analysis

Filename Score Overview
csrc/ir/composite_nodes.h 5/5 Adds GroupedBlockQuantizationOp class declaration with proper accessors for inputs, outputs, and attributes including layout, block size, and offset tensors
csrc/ir/composite_nodes.cpp 5/5 Implements GroupedBlockQuantizationOp constructor, toString methods, and evaluation placeholder - correctly manages inputs/outputs and attributes
csrc/codegen.cpp 5/5 Adds codegen handler for GroupedBlockQuantizationOp that generates runtime function call with template parameters for layout and correct function arguments
csrc/device_lower/pass/index.cpp 3/5 Implements index lowering for GroupedBlockQuantizationOp with logical index computation - contains hardcoded block_size bug at line 488
csrc/ops/arith.cpp 5/5 Implements groupedBlockQuantize API function with proper validation for block sizes (16 for nvfp4, 32 for mxfp8) and tensor domain construction with layout allocation
tests/cpp/test_layout_op.cpp 5/5 Adds comprehensive test for GroupedBlockQuantizationOp validating quantized output and block scaling factor layout correctness with grouped operations

Sequence Diagram

sequenceDiagram
    participant User as User/Python API
    participant API as ops/arith.cpp
    participant IR as IR Builder
    participant Fusion as Fusion Graph
    participant Scheduler as Scheduler
    participant IndexLower as Index Lowering
    participant Codegen as Code Generator
    participant Runtime as Runtime Kernel
    
    User->>API: groupedBlockQuantize(input, offsets, layout)
    API->>API: Validate inputs & block size
    API->>IR: Create tensor domains for output & scales
    IR->>IR: Build allocation domain with layout
    API->>Fusion: IrBuilder::create<GroupedBlockQuantizationOp>
    Fusion->>Fusion: Add GroupedBlockQuantizationOp to graph
    
    Note over Scheduler: Compilation Phase
    Scheduler->>Scheduler: Detect GroupedBlockQuantizationOp
    Scheduler->>Scheduler: Cap vectorization factor (≤4)
    Scheduler->>Scheduler: Apply pointwise scheduling
    
    Note over IndexLower: Lowering Phase
    IndexLower->>IndexLower: handle(GroupedBlockQuantizationOp)
    IndexLower->>IndexLower: Compute logical indices [row, col]
    IndexLower->>IndexLower: Validate block size divisibility
    IndexLower->>Fusion: Create lowered op with indices
    
    Note over Codegen: Code Generation
    Codegen->>Codegen: handle(GroupedBlockQuantizationOp)
    Codegen->>Codegen: Extract group_size from loop domain
    Codegen->>Codegen: Build template args (layout params)
    Codegen->>Codegen: Build function args (tensors, offsets, k, g)
    Codegen->>Runtime: Generate call to bq::grouped_block_quantize_to_nvfp4
    
    Note over Runtime: Execution
    Runtime->>Runtime: Perform block quantization per group
    Runtime->>Runtime: Apply layout transformation (Block128x4)
    Runtime->>User: Return quantized tensor & block scales
Loading

@jjsjann123
Copy link
Collaborator Author

!test

@jjsjann123
Copy link
Collaborator Author

!test

@jjsjann123 jjsjann123 requested a review from protonu January 9, 2026 23:26
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Overview

This PR adds codegen support for GroupedBlockQuantizationOp, a new IR node that merges BlockQuantizationOp and PreprocessGroupedMatmulInputSf into a single operation. This optimization enables single-kernel quantization and layout handling for grouped tensors in nvfp4 quantization workflows.

Key Changes

Core IR Implementation

  • New IR Node: GroupedBlockQuantizationOp class added to composite_nodes.{h,cpp} with proper constructor, accessors, and attribute management
  • Input/Output Structure: Takes input tensor, input_offsets, output_offsets, k, g scalars, optional global_scale, plus row_idx/col_idx for indexing
  • Attributes: Stores block_size (index 0) and layout (index 1) as data attributes, plus optional row_idx/col_idx as regular attributes

Codegen and Lowering

  • Code Generation: Comprehensive handler in codegen.cpp that generates calls to bq::grouped_block_quantize_to_nvfp4 with proper template arguments and validation
  • Index Lowering: device_lower/pass/index.cpp implements index lowering with runtime validation that inner dimension is divisible by block_size
  • Validation: Extensive validation in device_lower/validation.cpp checks memory types, parallelization requirements (TIDx, BIDx, Group), and prevents z-axis parallelization

Integration Points

  • Dispatch: Registered in DISPATCH_FOR_ALL_EXPRS macro in dispatch.h
  • API: groupedBlockQuantize() function in ops/arith.{h,cpp} with input validation for data types, tensor dimensions, and block size requirements
  • Scheduler: Updates to scheduler/utils.cpp to handle offset tensors and block scales in caching logic
  • Other Files: Propagated through logical_domain_map, tensor_metadata, fusion_segmenter, and various analysis passes

Testing

  • Test Coverage: Single test case GroupedBlockQuantizeOp in test_layout_op.cpp validates correctness against reference implementation using BlockQuantizationOp
  • Test Scenario: 3 groups with [100, 150, 262] tokens, verifies both quantized output and block scaling factor layout

Architecture Consistency

The implementation follows established patterns from BlockQuantizationOp and PreprocessGroupedMatmulInputSf:

  • Similar attribute indexing patterns (adjusted for different parameter counts)
  • Consistent validation approach across lowering passes
  • Proper integration with scheduler and domain mapping infrastructure

Code Quality

Strengths:

  • Comprehensive validation at multiple stages (API, index lowering, device validation)
  • Proper error messages with context
  • Consistent with existing codebase patterns
  • Well-structured separation of concerns

⚠️ Minor Observations:

  • Currently only supports Float4_e2m1fn output (enforced in codegen.cpp:2003-2005)
  • Evaluation method not implemented (placeholder throws, which is acceptable for ops with runtime kernel implementations)
  • Single test case covers basic functionality but could benefit from additional edge case testing

Recommendation

The implementation is solid and ready to merge. The code is well-structured, properly validated, and consistently integrated across the codebase. The limitation to nvfp4 output is documented and intentional for this PR.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - comprehensive validation, consistent patterns, and proper testing
  • Score of 5 reflects thorough implementation with validation at every stage, consistent integration patterns matching existing ops, comprehensive error handling, and working test coverage. No critical issues found.
  • No files require special attention - implementation is consistent and well-validated across all changed files

Important Files Changed

File Analysis

Filename Score Overview
csrc/ir/composite_nodes.h 5/5 Adds GroupedBlockQuantizationOp class definition with proper accessor methods and attributes for merging quantization and layout operations
csrc/ir/composite_nodes.cpp 5/5 Implements GroupedBlockQuantizationOp constructor, toString methods, and placeholder evaluate; follows existing patterns from BlockQuantizationOp
csrc/codegen.cpp 5/5 Adds codegen handler for GroupedBlockQuantizationOp that generates runtime function calls with proper template args and validation
csrc/device_lower/pass/index.cpp 5/5 Implements index lowering for GroupedBlockQuantizationOp with validation that inner dimension is divisible by block size
csrc/device_lower/validation.cpp 5/5 Adds comprehensive validation for GroupedBlockQuantizationOp including memory type checks, parallelization requirements, and group dimension verification
csrc/ops/arith.cpp 5/5 Implements groupedBlockQuantize with comprehensive input validation, output tensor creation, and proper attribute setup
csrc/dispatch.h 5/5 Registers GroupedBlockQuantizationOp in dispatcher macro for proper IR node handling across the codebase
csrc/scheduler/utils.cpp 5/5 Updates scheduler utilities to handle GroupedBlockQuantizationOp offset tensors and block scales correctly in caching logic
tests/cpp/test_layout_op.cpp 5/5 Adds test case for GroupedBlockQuantizeOp verifying correct quantization and grouped layout transformation against reference

Sequence Diagram

sequenceDiagram
    participant User as Python/C++ API
    participant API as groupedBlockQuantize()
    participant IR as GroupedBlockQuantizationOp
    participant Scheduler as Scheduler
    participant IndexLower as Index Lowering
    participant Validation as Device Validation
    participant Codegen as Code Generator
    participant Runtime as CUDA Runtime

    User->>API: Call groupedBlockQuantize(input, offsets, layout)
    API->>API: Validate input dtype (Float/BF16/Half)
    API->>API: Validate 2D tensor
    API->>API: Check block_size (16 for nvfp4, 32 for mxfp8)
    API->>IR: Create GroupedBlockQuantizationOp node
    IR->>IR: Store inputs: input, offsets, k, g
    IR->>IR: Store attributes: block_size, layout
    IR->>Scheduler: Schedule fusion
    Scheduler->>Scheduler: Apply pointwise scheduler
    Scheduler->>Scheduler: Handle offset tensors in caching
    Scheduler->>IndexLower: Lower to kernel IR
    IndexLower->>IndexLower: Compute logical indices
    IndexLower->>IndexLower: Validate inner_dim % block_size == 0
    IndexLower->>IR: Create lowered GroupedBlockQuantizationOp
    IR->>Validation: Validate lowered op
    Validation->>Validation: Check MemoryType::Local
    Validation->>Validation: Verify TIDx and BIDx present
    Validation->>Validation: Check Group dimension exists
    Validation->>Validation: Ensure no z-axis parallelization
    Validation->>Codegen: Pass validated op
    Codegen->>Codegen: Extract group_size from loop domain
    Codegen->>Codegen: Validate group_size (2/4/8 for half, 2/4 for float)
    Codegen->>Codegen: Build template args (layout params, group_size)
    Codegen->>Codegen: Build function args (tensors, indices, offsets)
    Codegen->>Runtime: Generate call to bq::grouped_block_quantize_to_nvfp4
    Runtime->>Runtime: Execute quantization kernel
    Runtime-->>User: Return quantized_tensor + block_scales
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants