Conversation
Greptile SummaryThis PR performs a large-scale refactoring of SIMD math kernel files, splitting monolithic per-type files (e.g. Key issues found:
Confidence Score: 2/5
Last reviewed commit: a353f42 |
|
@greptile |
src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx512.cc
Outdated
Show resolved
Hide resolved
|
@greptile |
|
@greptile |
| #define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \ | ||
| ymm_sum_norm1, ymm_sum_norm2) \ | ||
| { \ | ||
| __m256i ymm_lhs_0 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \ | ||
| __m256i ymm_rhs_0 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \ | ||
| __m256i ymm_lhs_1 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, \ | ||
| _mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \ | ||
| __m256i ymm_rhs_1 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, \ | ||
| _mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \ | ||
| FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \ | ||
| FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \ | ||
| FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \ | ||
| FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \ | ||
| FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \ | ||
| FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \ | ||
| } | ||
|
|
There was a problem hiding this comment.
FMA_INT4_ITER_AVX parameter name mismatch causes compile error
The macro's 4th parameter is named ymm_sum1 (no underscore before 1), but the body at line 154 references ymm_sum_1 (with an underscore). These are different preprocessor tokens, so the parameter is never substituted.
At the call site in mips_euclidean_distance_matrix_int4_avx2.cc lines 39 and 62, the code passes ymm_sum1 as the 4th argument — but the local variable in that function is declared as ymm_sum_1 (with underscore), so ymm_sum1 is undefined at the call site. This will fail to compile with "use of undeclared identifier 'ymm_sum1'".
The fix is to align the parameter name with what the body uses:
| #define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \ | |
| ymm_sum_norm1, ymm_sum_norm2) \ | |
| { \ | |
| __m256i ymm_lhs_0 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \ | |
| __m256i ymm_rhs_0 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \ | |
| __m256i ymm_lhs_1 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, \ | |
| _mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \ | |
| __m256i ymm_rhs_1 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, \ | |
| _mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \ | |
| FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \ | |
| FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \ | |
| FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \ | |
| FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \ | |
| FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \ | |
| FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \ | |
| } | |
| #define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum_1, \ | |
| ymm_sum_norm1, ymm_sum_norm2) \ |
|
@greptile |
| xmm_sum_0 = _mm_fmadd_ps(xmm_d_0, xmm_d_0, xmm_sum_0); | ||
| xmm_sum_1 = _mm_fmadd_ps(xmm_d_1, xmm_d_1, xmm_sum_1); |
There was a problem hiding this comment.
FMA intrinsic used in SSE-only translation unit
_mm_fmadd_ps is a Fused Multiply-Add (FMA) intrinsic that requires the FMA ISA extension, which is introduced by Haswell (-march=core-avx2). The per-file compile flag MATH_MARCH_FLAG_SSE is set to "-march=corei7" (Intel Nehalem/Core i7, SSE4.2 only), which does not define __FMA__. Compiling this file without FMA support will fail at the preprocessor/compiler stage because _mm_fmadd_ps is gated behind #ifdef __FMA__ in standard intrinsic headers.
The same problem exists throughout this file (lines 35, 36, 41, 49, 50, 55) and also in the other _sse.cc files: inner_product_matrix_fp32_sse.cc, mips_euclidean_distance_matrix_fp32_sse.cc, etc.
The fix is either to:
- Change
MATH_MARCH_FLAG_SSEfrom"-march=corei7"to a march that includes FMA (e.g."-march=haswell"or"-march=core-avx2"), OR - Replace
_mm_fmadd_ps(a, b, c)with an explicit_mm_add_ps(_mm_mul_ps(a, b), c)in the SSE code paths so no FMA instruction is required.
| xmm_sum_0 = _mm_fmadd_ps(xmm_d_0, xmm_d_0, xmm_sum_0); | |
| xmm_sum_1 = _mm_fmadd_ps(xmm_d_1, xmm_d_1, xmm_sum_1); | |
| xmm_sum_0 = _mm_add_ps(_mm_mul_ps(xmm_d_0, xmm_d_0), xmm_sum_0); | |
| xmm_sum_1 = _mm_add_ps(_mm_mul_ps(xmm_d_1, xmm_d_1), xmm_sum_1); |
|
|
||
| function(setup_compiler_march_for_x86 VAR_NAME_SSE VAR_NAME_AVX2 VAR_NAME_AVX512) | ||
| #sse | ||
| set(${VAR_NAME_SSE} "-march=corei7" PARENT_SCOPE) | ||
|
|
||
| #avx 2 | ||
| set(${VAR_NAME_AVX2} "-march=core-avx2" PARENT_SCOPE) | ||
|
|
There was a problem hiding this comment.
AVX512 fallback silently upgrades SSE-only builds to AVX2
When no AVX512 architecture flag is accepted by the compiler (i.e. the probing loop for graniterapids, emeraldrapids, sapphirerapids, skylake-avx512 all fail), VAR_NAME_AVX512 is set to "-march=core-avx2". This means that on a machine whose compiler only supports up to AVX2, the "AVX512 group" files (including all _dispatch.cc and _avx512.cc files) will be compiled with core-avx2 flags.
While this avoids using unavailable AVX512 instructions, it silently changes the expected ABI for the dispatch files — the dispatch files contain runtime CPU detection and call the _avx512.cc functions only when AVX512 is confirmed at runtime. If the _avx512.cc files were compiled with core-avx2 instead, then calling them on a non-AVX512 CPU at runtime would work, but the AVX512 intrinsics inside those files still require AVX512 flags to compile in the first place. The warning message should make clear that the _avx512.cc files may not be fully compiled:
| function(setup_compiler_march_for_x86 VAR_NAME_SSE VAR_NAME_AVX2 VAR_NAME_AVX512) | |
| #sse | |
| set(${VAR_NAME_SSE} "-march=corei7" PARENT_SCOPE) | |
| #avx 2 | |
| set(${VAR_NAME_AVX2} "-march=core-avx2" PARENT_SCOPE) | |
| set(${VAR_NAME_AVX512} "" PARENT_SCOPE) | |
| message(WARNING "No known avx512 microarchitecture flag found. AVX512 implementations will not be built.") |
| #define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \ | ||
| ymm_sum_norm1, ymm_sum_norm2) \ | ||
| { \ | ||
| __m256i ymm_lhs_0 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \ | ||
| __m256i ymm_rhs_0 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \ | ||
| __m256i ymm_lhs_1 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, \ | ||
| _mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \ | ||
| __m256i ymm_rhs_1 = _mm256_shuffle_epi8( \ | ||
| INT4_LOOKUP_AVX, \ | ||
| _mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \ | ||
| FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \ | ||
| FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \ | ||
| FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \ | ||
| FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \ | ||
| FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \ | ||
| FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \ | ||
| } |
There was a problem hiding this comment.
FMA_INT4_ITER_AVX 4th parameter name mismatch — compile error
The macro's 4th parameter is declared as ymm_sum1 (no underscore before 1) on the definition line, but the body at line 154 references ymm_sum_1 (with underscore). These are different preprocessor tokens — the parameter will never be substituted, and the macro body will reference an undeclared identifier at every call site.
| #define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \ | |
| ymm_sum_norm1, ymm_sum_norm2) \ | |
| { \ | |
| __m256i ymm_lhs_0 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \ | |
| __m256i ymm_rhs_0 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \ | |
| __m256i ymm_lhs_1 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, \ | |
| _mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \ | |
| __m256i ymm_rhs_1 = _mm256_shuffle_epi8( \ | |
| INT4_LOOKUP_AVX, \ | |
| _mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \ | |
| FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \ | |
| FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \ | |
| FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \ | |
| FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \ | |
| FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \ | |
| FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \ | |
| } | |
| #define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum_1, \ | |
| ymm_sum_norm1, ymm_sum_norm2) \ |
| const ValueType *q, | ||
| size_t dim, | ||
| float *out) { | ||
| #if defined(__ARM_NEON) |
There was a problem hiding this comment.
Test performance difference with zvec-turbo style and adjust implementation based on the results. This can be handled either in this PR or in another this PR after this PR is merged.
void SquaredEuclideanDistanceMatrix<float, 1, 1>::Compute(const ValueType *m,
const ValueType *q,
size_t dim,
float *out) {
static float (*impl_func_)(const float*, const float*, size_t);
std::call_once(...{
// set impl_func_
}
);
*out = impl_func_(m, q, dim);
return;
}
march based reorganization
Greptile Summary
This PR is a large-scale "march-based reorganization" that splits monolithic per-type SIMD implementation files (e.g.
euclidean_distance_matrix_fp32.cc) into separate per-ISA translation units (_sse.cc,_avx2.cc,_avx512.cc,_neon.cc,_dispatch.cc), introduces shared macro utility headers (.ifiles), and updates the build system to assign the correct per-file-march=flags to each group. Several previously-flagged bugs (NEON march flag, AVX512 tail guard, off-by-one in FP16 batch, norm2 silent no-op, dead AVX512FP16 branch) appear to have been fixed in this iteration.Key issues found:
FMA intrinsics in SSE translation units (compile error): All
*_sse.ccfiles directly call_mm_fmadd_ps, a Fused Multiply-Add intrinsic that requires__FMA__(Haswell/core-avx2or later). These files are compiled withMATH_MARCH_FLAG_SSE = "-march=corei7"(Nehalem, SSE4.2 only, no FMA), so they will fail to compile. Either changeMATH_MARCH_FLAG_SSEto a march that includes FMA (e.g."-march=haswell"), or replace_mm_fmadd_ps(a, b, c)with_mm_add_ps(_mm_mul_ps(a, b), c)in all SSE paths. This affectseuclidean_distance_matrix_fp32_sse.cc,inner_product_matrix_fp32_sse.cc,mips_euclidean_distance_matrix_fp32_sse.cc, and similar files.FMA_INT4_ITER_AVXparameter name mismatch indistance_matrix_mips_utility.i: The macro's 4th parameter is declaredymm_sum1but the body referencesymm_sum_1, causing a compile error at all AVX2 MIPS int4 call sites.setup_compiler_march_for_x86silently falls back tocore-avx2for the AVX512 group when no AVX512-capable toolchain is found. This should either set the variable to empty or explicitly communicate that AVX512 files will be skipped/omitted.Confidence Score: 1/5
-march=corei7flag, causing a build failure on any x86 build.FMA_INT4_ITER_AVXparameter name mismatch is also a compile error for MIPS int4 AVX2 code. These blocking issues need to be resolved before the PR can land.*_sse.ccfiles (euclidean_distance_matrix_fp32_sse.cc,inner_product_matrix_fp32_sse.cc,mips_euclidean_distance_matrix_fp32_sse.cc, etc.) andsrc/ailego/math/distance_matrix_mips_utility.i.Important Files Changed
_mm_fmadd_ps(FMA intrinsic) throughout, but is compiled with-march=corei7which does not enable FMA — will fail to compile._detect_armv8_best/_detect_x86_bestwith simpler_setup_armv8_march/_setup_x86_marchand newsetup_compiler_march_for_x86that returns per-file flags; AVX512 fallback incorrectly falls back tocore-avx2with confusing warning.MATH_MARCH_FLAG_NEON(previous issue fixed).FMA_INT8_GENERALis defined twice with different signatures (3-param at line 77 and 5-param at line 106), causing a macro redefinition;NEGZEROS_FP32_AVXis commented out but its AVX512 equivalent is also missing.FMA_INT4_ITER_AVXhas a parameter name mismatch (ymm_sum1vsymm_sum_1) that prevents correct substitution and causes compile errors at all call sites.<=(off-by-one previously flagged was fixed); logic and structure look correct.<(previously-flagged always-true issue resolved). AVX512F path has appropriate 16- and 8-element sub-loops after the main 32-wide loop.#ifguards instead of#if/#elifchain, so all tiers are compiled in and selected at runtime correctly.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD SRC["Source Files (*.cc)"] SRC --> SSE["*_sse.cc\n(MATH_MARCH_FLAG_SSE\n= -march=corei7)"] SRC --> AVX2["*_avx2.cc / *_avx.cc\n(MATH_MARCH_FLAG_AVX2\n= -march=core-avx2)"] SRC --> AVX512["*_avx512.cc + *_dispatch.cc\n(MATH_MARCH_FLAG_AVX512\n= best or fallback to core-avx2)"] SRC --> NEON["*_neon.cc + *_dispatch.cc\n(MATH_MARCH_FLAG_NEON\n= -march=armv8-a)"] SSE -->|"Links into"| LIB["zvec_ailego static lib"] AVX2 -->|"Links into"| LIB AVX512 -->|"Links into"| LIB NEON -->|"Links into"| LIB LIB --> DISP["*_dispatch.cc\n(Runtime CPU feature check)"] DISP -->|"AVX512F detected"| CALL_AVX512["Calls *AVX512* functions"] DISP -->|"AVX2 detected"| CALL_AVX2["Calls *AVX2* functions"] DISP -->|"Fallback"| CALL_SSE["Calls *SSE* functions"] SSE -.->|"❌ _mm_fmadd_ps requires FMA\nnot in -march=corei7"| BUG["COMPILE ERROR"]Comments Outside Diff (3)
src/ailego/math_batch/inner_product_distance_batch.h, line 84-110 (link)Missing
GetQueryPreprocessFuncinfloatandFloat16specializations causes a compile errorThe outer
InnerProductDistanceBatch::GetQueryPreprocessFunc()(line 84) unconditionally callsInnerProductDistanceBatchImpl<ValueType, 1>::GetQueryPreprocessFunc()for every value type. However:InnerProductDistanceBatchImpl<float, 1>(line 100) declares noGetQueryPreprocessFunc.InnerProductDistanceBatchImpl<ailego::Float16, 1>(line 91) declares noGetQueryPreprocessFunc.Full template specializations do not inherit members from the primary template. As a result, any code path that instantiates
InnerProductDistanceBatch<float, ...>::GetQueryPreprocessFunc()orInnerProductDistanceBatch<ailego::Float16, ...>::GetQueryPreprocessFunc()will produce a compile error:error: no member named 'GetQueryPreprocessFunc' in 'InnerProductDistanceBatchImpl<float, 1>'.Both specializations need the method added. Since floating-point types require no preprocessing, the implementation should simply return
nullptr:The same fix is needed for
InnerProductDistanceBatchImpl<ailego::Float16, 1>.src/ailego/math_batch/inner_product_distance_batch_impl_int8_avx2.cc, line 600-603 (link)Heap allocation inside hot SIMD loop
std::vector<__m256i> data_regs(dp_batch)is declared inside the innermost computation loop, triggering a dynamic heap allocation on every iteration. Sincedp_batchis a compile-time template parameter, this should be astd::array(and moved outside the loop body). Theaccsvector just above also heap-allocates per call.Contrast with
inner_product_distance_batch_impl_fp32_avx2.ccand the AVX512 variants, which correctly usestd::array<..., dp_batch>for all accumulators. Usingstd::vectorhere defeats the purpose of the SIMD optimisation for this hot path.Also move
accsfrom astd::vectortostd::array<__m256i, dp_batch> accs(outside the loop, as it already is, just change the type).src/ailego/math_batch/inner_product_distance_batch_impl_int8_avx512.cc, line 754 (link)Non-portable POSIX type
u_int8_tu_int8_tis a POSIX extension; it is not part of standard C++. The portable equivalent isuint8_tfrom<cstdint>(or<stdint.h>), which is available on all C++11-and-later targets including MSVC.Last reviewed commit: 39404d1