refactor/march based reorganization by richyreachy · Pull Request #193 · alibaba/zvec

richyreachy · 2026-03-03T08:15:59Z

march based reorganization

Greptile Summary

This PR is a large-scale "march-based reorganization" that splits monolithic per-type SIMD implementation files (e.g. euclidean_distance_matrix_fp32.cc) into separate per-ISA translation units (_sse.cc, _avx2.cc, _avx512.cc, _neon.cc, _dispatch.cc), introduces shared macro utility headers (.i files), and updates the build system to assign the correct per-file -march= flags to each group. Several previously-flagged bugs (NEON march flag, AVX512 tail guard, off-by-one in FP16 batch, norm2 silent no-op, dead AVX512FP16 branch) appear to have been fixed in this iteration.

Key issues found:

FMA intrinsics in SSE translation units (compile error): All *_sse.cc files directly call _mm_fmadd_ps, a Fused Multiply-Add intrinsic that requires __FMA__ (Haswell/core-avx2 or later). These files are compiled with MATH_MARCH_FLAG_SSE = "-march=corei7" (Nehalem, SSE4.2 only, no FMA), so they will fail to compile. Either change MATH_MARCH_FLAG_SSE to a march that includes FMA (e.g. "-march=haswell"), or replace _mm_fmadd_ps(a, b, c) with _mm_add_ps(_mm_mul_ps(a, b), c) in all SSE paths. This affects euclidean_distance_matrix_fp32_sse.cc, inner_product_matrix_fp32_sse.cc, mips_euclidean_distance_matrix_fp32_sse.cc, and similar files.
FMA_INT4_ITER_AVX parameter name mismatch in distance_matrix_mips_utility.i: The macro's 4th parameter is declared ymm_sum1 but the body references ymm_sum_1, causing a compile error at all AVX2 MIPS int4 call sites.
setup_compiler_march_for_x86 silently falls back to core-avx2 for the AVX512 group when no AVX512-capable toolchain is found. This should either set the variable to empty or explicitly communicate that AVX512 files will be skipped/omitted.

Confidence Score: 1/5

Not safe to merge — the SSE translation units use FMA intrinsics incompatible with the assigned -march=corei7 flag, causing a build failure on any x86 build.
The FMA-in-SSE issue is a hard compile error that will break the build for all x86 targets. Additionally, the FMA_INT4_ITER_AVX parameter name mismatch is also a compile error for MIPS int4 AVX2 code. These blocking issues need to be resolved before the PR can land.
All *_sse.cc files (euclidean_distance_matrix_fp32_sse.cc, inner_product_matrix_fp32_sse.cc, mips_euclidean_distance_matrix_fp32_sse.cc, etc.) and src/ailego/math/distance_matrix_mips_utility.i.

Important Files Changed

Filename	Overview
src/ailego/math/euclidean_distance_matrix_fp32_sse.cc	New SSE-specific implementation file that directly uses `_mm_fmadd_ps` (FMA intrinsic) throughout, but is compiled with `-march=corei7` which does not enable FMA — will fail to compile.
cmake/option.cmake	Refactored to replace `_detect_armv8_best`/`_detect_x86_best` with simpler `_setup_armv8_march`/`_setup_x86_march` and new `setup_compiler_march_for_x86` that returns per-file flags; AVX512 fallback incorrectly falls back to `core-avx2` with confusing warning.
src/ailego/CMakeLists.txt	Adds per-file march flag logic for x86/ARM: dispatch files are still grouped under the AVX512 bucket (previously flagged), ARM NEON path now correctly sets `MATH_MARCH_FLAG_NEON` (previous issue fixed).
src/ailego/math/distance_matrix_inner_product_utility.i	New shared macro utility for inner product computations; `FMA_INT8_GENERAL` is defined twice with different signatures (3-param at line 77 and 5-param at line 106), causing a macro redefinition; `NEGZEROS_FP32_AVX` is commented out but its AVX512 equivalent is also missing.
src/ailego/math/distance_matrix_mips_utility.i	New shared macro utility for MIPS distance; `FMA_INT4_ITER_AVX` has a parameter name mismatch (`ymm_sum1` vs `ymm_sum_1`) that prevents correct substitution and causes compile errors at all call sites.
src/ailego/math/inner_product_matrix_fp16_dispatch.cc	New dispatch file for FP16 inner product; correctly checks AVX512FP16 before AVX512F before AVX (previously dead-code issue has been resolved).
src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc	Tail-element boundary check now correctly uses `<=` (off-by-one previously flagged was fixed); logic and structure look correct.
src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx512.cc	AVX512FP16 and AVX512F paths cleanly separated; tail guard now uses `<` (previously-flagged always-true issue resolved). AVX512F path has appropriate 16- and 8-element sub-loops after the main 32-wide loop.
src/ailego/math/norm2_matrix_fp32.cc	Silent no-op issue resolved — now uses nested independent `#if` guards instead of `#if/#elif` chain, so all tiers are compiled in and selected at runtime correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    SRC["Source Files (*.cc)"]
    SRC --> SSE["*_sse.cc\n(MATH_MARCH_FLAG_SSE\n= -march=corei7)"]
    SRC --> AVX2["*_avx2.cc / *_avx.cc\n(MATH_MARCH_FLAG_AVX2\n= -march=core-avx2)"]
    SRC --> AVX512["*_avx512.cc + *_dispatch.cc\n(MATH_MARCH_FLAG_AVX512\n= best or fallback to core-avx2)"]
    SRC --> NEON["*_neon.cc + *_dispatch.cc\n(MATH_MARCH_FLAG_NEON\n= -march=armv8-a)"]

    SSE -->|"Links into"| LIB["zvec_ailego static lib"]
    AVX2 -->|"Links into"| LIB
    AVX512 -->|"Links into"| LIB
    NEON -->|"Links into"| LIB

    LIB --> DISP["*_dispatch.cc\n(Runtime CPU feature check)"]
    DISP -->|"AVX512F detected"| CALL_AVX512["Calls *AVX512* functions"]
    DISP -->|"AVX2 detected"| CALL_AVX2["Calls *AVX2* functions"]
    DISP -->|"Fallback"| CALL_SSE["Calls *SSE* functions"]

    SSE -.->|"❌ _mm_fmadd_ps requires FMA\nnot in -march=corei7"| BUG["COMPILE ERROR"]

Comments Outside Diff (3)

src/ailego/math_batch/inner_product_distance_batch.h, line 84-110 (link)

Missing GetQueryPreprocessFunc in float and Float16 specializations causes a compile error

The outer InnerProductDistanceBatch::GetQueryPreprocessFunc() (line 84) unconditionally calls InnerProductDistanceBatchImpl<ValueType, 1>::GetQueryPreprocessFunc() for every value type. However:
- InnerProductDistanceBatchImpl<float, 1> (line 100) declares no GetQueryPreprocessFunc.
- InnerProductDistanceBatchImpl<ailego::Float16, 1> (line 91) declares no GetQueryPreprocessFunc.
Full template specializations do not inherit members from the primary template. As a result, any code path that instantiates InnerProductDistanceBatch<float, ...>::GetQueryPreprocessFunc() or InnerProductDistanceBatch<ailego::Float16, ...>::GetQueryPreprocessFunc() will produce a compile error: error: no member named 'GetQueryPreprocessFunc' in 'InnerProductDistanceBatchImpl<float, 1>'.

Both specializations need the method added. Since floating-point types require no preprocessing, the implementation should simply return nullptr:
```
template <>
struct InnerProductDistanceBatchImpl<float, 1> {
  using ValueType = float;
  static void compute_one_to_many(const float *query, const float **ptrs,
                                  std::array<const float *, 1> &prefetch_ptrs,
                                  size_t dim, float *sums);
  static DistanceBatchQueryPreprocessFunc GetQueryPreprocessFunc() {
    return nullptr;
  }
};
```
The same fix is needed for InnerProductDistanceBatchImpl<ailego::Float16, 1>.
src/ailego/math_batch/inner_product_distance_batch_impl_int8_avx2.cc, line 600-603 (link)

Heap allocation inside hot SIMD loop

std::vector<__m256i> data_regs(dp_batch) is declared inside the innermost computation loop, triggering a dynamic heap allocation on every iteration. Since dp_batch is a compile-time template parameter, this should be a std::array (and moved outside the loop body). The accs vector just above also heap-allocates per call.

Contrast with inner_product_distance_batch_impl_fp32_avx2.cc and the AVX512 variants, which correctly use std::array<..., dp_batch> for all accumulators. Using std::vector here defeats the purpose of the SIMD optimisation for this hot path.

Also move accs from a std::vector to std::array<__m256i, dp_batch> accs (outside the loop, as it already is, just change the type).
src/ailego/math_batch/inner_product_distance_batch_impl_int8_avx512.cc, line 754 (link)

Non-portable POSIX type u_int8_t

u_int8_t is a POSIX extension; it is not part of standard C++. The portable equivalent is uint8_t from <cstdint> (or <stdint.h>), which is available on all C++11-and-later targets including MSVC.

_{Last reviewed commit: 39404d1}

Greptile also left 3 inline comments on this PR.

greptile-apps · 2026-03-03T08:19:07Z

Greptile Summary

This PR performs a large-scale refactoring of SIMD math kernel files, splitting monolithic per-type files (e.g. euclidean_distance_matrix_fp32.cc) into separate _sse, _avx, _avx2, _avx512, _neon, and _dispatch translation units, each compiled with targeted -march flags via the new setup_compiler_march_for_x86 CMake helper.

Key issues found:

cmake/option.cmake:79 — Undefined variable in ARM march check: The _setup_armv8_march() function references undefined variable _ver instead of _arch in the compiler flag check, causing the validation to test an empty flag and trivially succeed.
src/ailego/CMakeLists.txt:98 — Unset MATH_MARCH_FLAG_NEON variable: The ARM code path uses ${MATH_MARCH_FLAG_NEON} in per-file compiler flags but never assigns the variable, resulting in empty compile flags for all ARM NEON/dispatch files instead of the intended -march=armv8-a+simd (or similar).
src/ailego/math/inner_product_matrix_fp16_dispatch.cc:145 — Dead code in FP16 sparse dispatch: The preprocessor checks __AVX__ before __AVX512FP16__. Since AVX512FP16 implies AVX (both are defined when compiling with -march=sapphirerapids or similar), the #elif defined(__AVX512FP16__) branch is unreachable, preventing the more optimal InnerProductSparseInSegmentAVX512FP16 path from ever being used.
src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc:67 — Off-by-one boundary check: After the 16-wide main loop, when exactly 8 elements remain, the condition dim + 8 < dimensionality is false and those elements fall through to the scalar loop instead of the 8-wide SIMD path. Use <= to capture that boundary case.

Confidence Score: 2/5

Not safe to merge — build system bugs will silently misconfigure ARM targets, and ISA dispatch/boundary check bugs degrade performance on x86.
Two CMake-level bugs (undefined _ver in ARM flag check, unset MATH_MARCH_FLAG_NEON) mean ARM SIMD files will be compiled without intended march flags. The inverted ISA priority in FP16 sparse dispatch makes the AVX512FP16 path permanently unreachable. The boundary check off-by-one skips the 8-wide SIMD path for edge cases. Together these affect correctness of the build system and performance of production code paths.
cmake/option.cmake and src/ailego/CMakeLists.txt (ARM march variable bugs) must be fixed before any ARM target builds; src/ailego/math/inner_product_matrix_fp16_dispatch.cc needs ISA check order corrected; src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc boundary condition fix.

_{Last reviewed commit: a353f42}

greptile-apps

_{60 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

src/ailego/CMakeLists.txt

src/ailego/math/euclidean_distance_matrix_fp32_avx512.cc

src/ailego/math/euclidean_distance_matrix_fp32_dispatch.cc

pyproject.toml

richyreachy · 2026-03-10T12:32:36Z

@greptile

src/ailego/math/distance_matrix_euclidean_utility.i

src/ailego/math/distance_matrix_inner_product_utility.i

src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx512.cc

richyreachy · 2026-03-10T12:52:57Z

@greptile

src/ailego/math/norm2_matrix_fp32.cc

richyreachy · 2026-03-11T02:14:20Z

@greptile

greptile-apps · 2026-03-11T02:19:45Z

src/ailego/math/distance_matrix_mips_utility.i

+#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1,           \
+                          ymm_sum_norm1, ymm_sum_norm2)                    \
+  {                                                                        \
+    __m256i ymm_lhs_0 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX));      \
+    __m256i ymm_rhs_0 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX));      \
+    __m256i ymm_lhs_1 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX,                                                   \
+        _mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \
+    __m256i ymm_rhs_1 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX,                                                   \
+        _mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \
+    FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0);                         \
+    FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1);                         \
+    FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1);                     \
+    FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1);                     \
+    FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2);                     \
+    FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2);                     \
+  }
+


FMA_INT4_ITER_AVX parameter name mismatch causes compile error

The macro's 4th parameter is named ymm_sum1 (no underscore before 1), but the body at line 154 references ymm_sum_1 (with an underscore). These are different preprocessor tokens, so the parameter is never substituted.

At the call site in mips_euclidean_distance_matrix_int4_avx2.cc lines 39 and 62, the code passes ymm_sum1 as the 4th argument — but the local variable in that function is declared as ymm_sum_1 (with underscore), so ymm_sum1 is undefined at the call site. This will fail to compile with "use of undeclared identifier 'ymm_sum1'".

The fix is to align the parameter name with what the body uses:

Suggested change

#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \

ymm_sum_norm1, ymm_sum_norm2) \

{ \

__m256i ymm_lhs_0 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \

__m256i ymm_rhs_0 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \

__m256i ymm_lhs_1 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, \

_mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \

__m256i ymm_rhs_1 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, \

_mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \

FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \

FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \

FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \

FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \

FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \

FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \

}

#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum_1, \

ymm_sum_norm1, ymm_sum_norm2) \

src/ailego/math/distance_matrix_inner_product_utility.i

richyreachy · 2026-03-11T02:28:20Z

@greptile

greptile-apps · 2026-03-11T02:36:02Z

src/ailego/math/euclidean_distance_matrix_fp32_sse.cc

+      xmm_sum_0 = _mm_fmadd_ps(xmm_d_0, xmm_d_0, xmm_sum_0);
+      xmm_sum_1 = _mm_fmadd_ps(xmm_d_1, xmm_d_1, xmm_sum_1);


FMA intrinsic used in SSE-only translation unit

_mm_fmadd_ps is a Fused Multiply-Add (FMA) intrinsic that requires the FMA ISA extension, which is introduced by Haswell (-march=core-avx2). The per-file compile flag MATH_MARCH_FLAG_SSE is set to "-march=corei7" (Intel Nehalem/Core i7, SSE4.2 only), which does not define __FMA__. Compiling this file without FMA support will fail at the preprocessor/compiler stage because _mm_fmadd_ps is gated behind #ifdef __FMA__ in standard intrinsic headers.

The same problem exists throughout this file (lines 35, 36, 41, 49, 50, 55) and also in the other _sse.cc files: inner_product_matrix_fp32_sse.cc, mips_euclidean_distance_matrix_fp32_sse.cc, etc.

The fix is either to:

Change MATH_MARCH_FLAG_SSE from "-march=corei7" to a march that includes FMA (e.g. "-march=haswell" or "-march=core-avx2"), OR

Replace _mm_fmadd_ps(a, b, c) with an explicit _mm_add_ps(_mm_mul_ps(a, b), c) in the SSE code paths so no FMA instruction is required.

Suggested change

xmm_sum_0 = _mm_fmadd_ps(xmm_d_0, xmm_d_0, xmm_sum_0);

xmm_sum_1 = _mm_fmadd_ps(xmm_d_1, xmm_d_1, xmm_sum_1);

xmm_sum_0 = _mm_add_ps(_mm_mul_ps(xmm_d_0, xmm_d_0), xmm_sum_0);

xmm_sum_1 = _mm_add_ps(_mm_mul_ps(xmm_d_1, xmm_d_1), xmm_sum_1);

greptile-apps · 2026-03-11T02:36:03Z

cmake/option.cmake

+
+function(setup_compiler_march_for_x86 VAR_NAME_SSE VAR_NAME_AVX2 VAR_NAME_AVX512)
+  #sse
+  set(${VAR_NAME_SSE} "-march=corei7" PARENT_SCOPE)
+
+  #avx 2
+  set(${VAR_NAME_AVX2} "-march=core-avx2" PARENT_SCOPE)
+


AVX512 fallback silently upgrades SSE-only builds to AVX2

When no AVX512 architecture flag is accepted by the compiler (i.e. the probing loop for graniterapids, emeraldrapids, sapphirerapids, skylake-avx512 all fail), VAR_NAME_AVX512 is set to "-march=core-avx2". This means that on a machine whose compiler only supports up to AVX2, the "AVX512 group" files (including all _dispatch.cc and _avx512.cc files) will be compiled with core-avx2 flags.

While this avoids using unavailable AVX512 instructions, it silently changes the expected ABI for the dispatch files — the dispatch files contain runtime CPU detection and call the _avx512.cc functions only when AVX512 is confirmed at runtime. If the _avx512.cc files were compiled with core-avx2 instead, then calling them on a non-AVX512 CPU at runtime would work, but the AVX512 intrinsics inside those files still require AVX512 flags to compile in the first place. The warning message should make clear that the _avx512.cc files may not be fully compiled:

Suggested change

function(setup_compiler_march_for_x86 VAR_NAME_SSE VAR_NAME_AVX2 VAR_NAME_AVX512)

#sse

set(${VAR_NAME_SSE} "-march=corei7" PARENT_SCOPE)

#avx 2

set(${VAR_NAME_AVX2} "-march=core-avx2" PARENT_SCOPE)

set(${VAR_NAME_AVX512} "" PARENT_SCOPE)

message(WARNING "No known avx512 microarchitecture flag found. AVX512 implementations will not be built.")

greptile-apps · 2026-03-11T02:36:04Z

src/ailego/math/distance_matrix_mips_utility.i

+#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1,           \
+                          ymm_sum_norm1, ymm_sum_norm2)                    \
+  {                                                                        \
+    __m256i ymm_lhs_0 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX));      \
+    __m256i ymm_rhs_0 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX));      \
+    __m256i ymm_lhs_1 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX,                                                   \
+        _mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \
+    __m256i ymm_rhs_1 = _mm256_shuffle_epi8(                               \
+        INT4_LOOKUP_AVX,                                                   \
+        _mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \
+    FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0);                         \
+    FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1);                         \
+    FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1);                     \
+    FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1);                     \
+    FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2);                     \
+    FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2);                     \
+  }


FMA_INT4_ITER_AVX 4th parameter name mismatch — compile error

The macro's 4th parameter is declared as ymm_sum1 (no underscore before 1) on the definition line, but the body at line 154 references ymm_sum_1 (with underscore). These are different preprocessor tokens — the parameter will never be substituted, and the macro body will reference an undeclared identifier at every call site.

Suggested change

#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \

ymm_sum_norm1, ymm_sum_norm2) \

{ \

__m256i ymm_lhs_0 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \

__m256i ymm_rhs_0 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \

__m256i ymm_lhs_1 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, \

_mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \

__m256i ymm_rhs_1 = _mm256_shuffle_epi8( \

INT4_LOOKUP_AVX, \

_mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \

FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \

FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \

FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \

FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \

FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \

FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \

}

#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum_1, \

ymm_sum_norm1, ymm_sum_norm2) \

egolearner · 2026-03-11T03:16:47Z

src/ailego/math/euclidean_distance_matrix_fp32_dispatch.cc

+                                                          const ValueType *q,
+                                                          size_t dim,
+                                                          float *out) {
+#if defined(__ARM_NEON)


Test performance difference with zvec-turbo style and adjust implementation based on the results. This can be handled either in this PR or in another this PR after this PR is merged.

void SquaredEuclideanDistanceMatrix<float, 1, 1>::Compute(const ValueType *m, const ValueType *q, size_t dim, float *out) { static float (*impl_func_)(const float*, const float*, size_t); std::call_once(...{ // set impl_func_ } ); *out = impl_func_(m, q, dim); return; }

egolearner

LGTM

richyreachy added 19 commits February 26, 2026 19:14

add for march compatible test

a2326a8

update codes

d4c9b6e

update code

8f61b0d

remove avx

2b0742d

update code

c907dc0

add euclidean dispatch

04fabee

add l2 int4 dispatch

9930d64

refactor codes

a8a238e

refactor mips codes

bd0060a

refactor inner product

c8a8126

refactor inner product metric

b64639a

update codes

1f78251

fix params

dd8c170

fix params

e92cf59

fix params

5bfbf80

fix params

8e5af92

fix mips metric

7096507

remove inline

b74d256

fix mips metric

4370619

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

src/ailego/CMakeLists.txt Outdated Show resolved Hide resolved

src/ailego/CMakeLists.txt Outdated Show resolved Hide resolved

richyreachy and others added 9 commits March 3, 2026 19:51

fix mips metric

2f2fba5

remove MxN distances

012fa99

update cmake

f0e2bab

refactor one2many distances

205b85a

fix inner product

ee477dd

add specialized function

5c22c92

fix compile

f9d162a

fix euclidean

af8f724

remove inner product MxN interface

fcbbeca

richyreachy and others added 2 commits March 9, 2026 15:39

fix: add macro scope

ad18793

fix for android ci

885c75c

feihongxu0824 requested review from egolearner and feihongxu0824 March 9, 2026 09:06

egolearner requested changes Mar 9, 2026

View reviewed changes

src/ailego/math/euclidean_distance_matrix_fp32_avx512.cc Outdated Show resolved Hide resolved

egolearner reviewed Mar 9, 2026

View reviewed changes

src/ailego/math/euclidean_distance_matrix_fp32_dispatch.cc Outdated Show resolved Hide resolved

feihongxu0824 reviewed Mar 10, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

richyreachy added 5 commits March 10, 2026 14:49

refactor: call euclidean via squared euclidean

7bf6ecb

refactor: add match utility

12fbe1e

refactor: comment config out

dcbc8ff

fix: fix dimension remainder

5cc1bb8

fix: fix condition & config flags

00a8821

fix: remove redundant macros

c9355b9

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

src/ailego/math/distance_matrix_euclidean_utility.i Show resolved Hide resolved

src/ailego/math/distance_matrix_inner_product_utility.i Show resolved Hide resolved

src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx512.cc Outdated Show resolved Hide resolved

fix: fix condition error

7fffa86

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

src/ailego/math/norm2_matrix_fp32.cc Show resolved Hide resolved

feihongxu0824 assigned iaojnh, egolearner and feihongxu0824 Mar 11, 2026

fix: fix macros

d81e79e

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

fix: remove unnecessary macros

39404d1

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

egolearner reviewed Mar 11, 2026

View reviewed changes

egolearner approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'main' into refactor/march_based_reorganization

d3f38fa

		xmm_sum_0 = _mm_fmadd_ps(xmm_d_0, xmm_d_0, xmm_sum_0);
		xmm_sum_1 = _mm_fmadd_ps(xmm_d_1, xmm_d_1, xmm_sum_1);

Conversation

richyreachy commented Mar 3, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Flowchart

Comments Outside Diff (3)

Uh oh!

greptile-apps bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richyreachy commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richyreachy commented Mar 10, 2026

Uh oh!

Uh oh!

richyreachy commented Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

richyreachy commented Mar 11, 2026

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

egolearner Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

egolearner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

richyreachy commented Mar 3, 2026 •

edited by greptile-apps bot

Loading

greptile-apps bot commented Mar 3, 2026 •

edited

Loading