Add fast validation for fp4#24
Open
b-shi wants to merge 34 commits into
Open
Conversation
* Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com>
* Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case
* Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints
…nfigs (#7) * 64x64 * Fix MFMA emit code * Remove label * cleanup * cleanup * Cleanup * Update tests * New GR offset calculation (no swizzling yet) * Refacto * cleanup * Re-enable swizzling * Fix SPGR alloc * Update M0 * Tensile passing no swizzling * Fix swizzling * LDS padding * as is * Multiple bugfixes * Fix 128x64 * Refactor pre-swizzling change * Add wave specific rotation to swizzling * Fix gra Test * Fix LRA test * Fix roundtrip test * LdsNumBytes as int * Use float type for bpe * Cleanup * Cleanup * More cleanup * cleanup * Simplify _grSwizzleColIds * Remove debug label * Fix typo LDS size calculation
* Add fp4 mfma support * Allow using Zeros, Ones and Identity for MX types * Display scales for MX types * Fix non subtileImpl path bug * Fix display issue on MX types (PrintTensor option)
…ernel (#6) * Add sample subtile impl * Fix issues when disabling subtile impl * GR Offset calculation (#1) * Add sample subtile impl * Move allocOffsetRegisters before setupNewTile * Start adding GR offset calculation * Rest of logic (no swizzling) * refacto * spgr offsets * Add newserial code * Add script to debug offsets * Add unit test for GR offset calculation * Grid display * Fix both code and ref test function * Add DPP quad perm to rocisa * Apply swizzling (no rotation yet) * Function swizzling + rotation + test * Refactor test to have a single output array + add test for SGPRs * Add debug mode to test + add dynamic wavegroup calculation based on MT * Fix test runtime issue and check all vgpr offsets * Add ref test code for 1x4 & 4x1 * Fix tests * Fixed SGPR offset calculation for 2x2 * Fix more tests * Add more tests * Refactor tests * simplify tests * Remove unused script * cleanup * fix camelCase in ref test code * cleanup * Fix typo --------- Co-authored-by: brianshi <brianshi@amd.com> * Enable post-loop code generation, and add some subroutines * LR offset calculation (#2) * Add tests * as is * Add permlane16_swap instruction to rocisa * Ongoing progress * Draft for partition A0/A1 * Wave partitioning * Draft ref code in tests * Handle 1x4 wavesplit param * 2x2 test passing * Draft 1x4 LR wave partitioning * Fix alginement issue * Integration testing * Update integration test * Fix swizzling pattern on GRA. Only swizzling on even LDS rows * Subtile based test * testing A * Test both A and B * Remove graonly mode * Fix 1x4 case * Move global offset for B after rest of the logic * cleanup * cleanup * Fix ref test code for 4x1 * Fix spgr alloc issue * Remove tmp test file * Remove debug prints * Add test case * Add GR load emit logic, and misc fixes (#3) * gr emit fix * Emit LR + init ACCVGPR (#4) * Emit ds_reads * Add waits for LR and GR * Init Acc VGPR to Zero * Add missing bit_length on VLShiftLeftB32 * Insert SNop between VLShiftLeftB32 & VReadfirstlaneB32 for correctness * Fix gra test ref code for 1x4 * Remove some debug prints * Add loop and ptr update code * Update scale offset * Add tests * Address review * Add scale roundtrip e2e test and constraint assertions Add GR->LDS->LR roundtrip GPU test verifying scale offset consistency across 4 tile configs x 2 matrices. Add power-of-2 assertion for scaleBlockSize and matching scaleBlockSize assertions for A/B in shared GR/LR offset computation. Pass kernel dict to compute_lds_sizes instead of re-deriving MIWaveGroup from tile dimensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update fixes * Fix scale being skipped * Add flag to print layout * Fix missed merge conflicts * Fix missed merge conflicts * Refactor scale rountrip test with gpu helper fns * Fix extra spaces * Fix tests --------- Co-authored-by: brianshi <brianshi@amd.com> Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com> Co-authored-by: b-shi <bbbrianme@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add optimized storeD code Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Address comments in PR, add some misc fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add pk_f16 cvt support Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enable subtile impl only for gfx950 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…MXInput (#11) Problem: When the tensilelite-client is configured with init-mxScaleA=One and init-mxScaleB=One, the MX scale tensors should contain the value 1.0 (E8M0 byte 127). Instead, they contained 2.0 (E8M0 byte 128), and the FP4 data tensors contained 0.5 instead of 1.0. Root cause: The mxDataGenerator library's DataGeneratorOptions has a forceDenorm flag that defaults to true. When forceDenorm is true, the generator's setOne<ocp_e2m1_mxfp4>() function uses a subnormal decomposition of 1.0: it sets the FP4 data to the subnormal value 0.5 (dataSubNormalOneMask) and the E8M0 scale to 2.0 (Constants::E8M0_2 = 128), so that the product 0.5 * 2.0 = 1.0 is still correct. When forceDenorm is false, it uses the normal decomposition: FP4 data = 1.0 (oneMask) and scale = 1.0 (Constants::E8M0_1 = 127). The generateMXInput() function in mxDataGen.cpp never set this option, inheriting the default forceDenorm=true. This caused init modes like "Ones" to produce unexpected data/scale values even though the float product was mathematically correct. Fix: Set opt.forceDenorm = false in generateMXInput() so that deterministic init modes (Ones, Identity, Sequential, etc.) produce the intuitive normal-form data and scale values. Impact: No existing callers are affected: - The hipblaslt client (testing_matmul.hpp) only allows hpl, trig_float, or uniform_01 init methods for MX data, all of which use the Bounded or TrigonometricFromFloat code paths that do not call setOne. - The MXDataGen unit tests all use "Bounded" init method. - Only the tensilelite client passes "Ones" (via initModeToMXMethod), which is the path this fixes. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…kernel (#10) Enable the MX FP4 scale emit code in the subtile-based kernel --------- Co-authored-by: Koji Nakajima <Koji.Nakajima@amd.com> Co-authored-by: Archana Ramalingam <Archana.Ramalingam@amd.com> Co-authored-by: Brian Shi <brianshi@amd.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Remove duplicate scale loads * Bugfix scheduler (#15) * Fix test * Single subtile size attempt * Unrolling * bug fix * cleanup * refactor allocator * Simplify allocator * Dont use allocator for scales * Use unrolling only when number of partition is odd * Remove duplicated buffer_load * Fix duplicate scale load after rebase * Fix merge conflicts * Fix beta=0, address comments from PR --------- Co-authored-by: sebvince <115461989+sebvince@users.noreply.github.com>
* Enable DU > 256, and reduce sgpr allocation * Address comments from PR
* custom Scale init * Add env-var fallback * Fix build issue * Fix an issue * Support swizzle case * Move scale init to mxdatagenerator * Add tests * Fixes * Make mxDataGenerator visible to all tensilelite targets * Update MXScaleBlockI/J comment
* Fix tensilelite test failures * minor clean-up * Add subtile test yaml
This reverts commit 09a46d9.
* Enable FixSrd2 for A/B Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Address comments from PR --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>
nakajee
reviewed
Apr 8, 2026
| case InitMode::TrigIndAbsCos: | ||
| case InitMode::Count: | ||
| throw std::runtime_error("Invalid InitMode."); | ||
| case InitMode::Fast1: |
Collaborator
There was a problem hiding this comment.
Maybe better to put this before line 522?
| return MXScale(getValueWithUpperLowerBoundFP<float>()); | ||
| } | ||
|
|
||
| // Fast1: random choice from {-1, 0, 1} — only MX FP4 (Float4x2) is supported. |
Collaborator
There was a problem hiding this comment.
The comment is a bit confusing.
Can this work with MXBlockA/B=0?
Owner
Author
There was a problem hiding this comment.
Yeah, this is only supported for MXFP4 for now. Will update the comment 😅
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fast1 Data Initialization Mode (
DataInitTypeA/B: 27)Fast1 is a new client-side init mode for MX FP4 GEMMs that enables fast, closed-form correctness validation without a full CPU GEMM reference.
How it works:
{-1, 0, 1}values; all active columns of B share a separate K-element pattern.(m, n)pair is a single integer dot productalpha * dot(patA, patB), gated by the scale block pattern. Inactive positions produce 0.Why it's faster:
Validation requires only one O(K) dot product instead of an O(M·N·K) CPU GEMM, making it practical for large problem sizes. The closed-form reference is exact for float output and tolerance-bounded (
K * epsilon) for BFloat16.Restrictions: MX FP4 A/B only; bias, activation, and E-output are not supported.