Skip to content

refactor/march based reorganization#193

Open
richyreachy wants to merge 71 commits intomainfrom
refactor/march_based_reorganization
Open

refactor/march based reorganization#193
richyreachy wants to merge 71 commits intomainfrom
refactor/march_based_reorganization

Conversation

@richyreachy
Copy link
Collaborator

@richyreachy richyreachy commented Mar 3, 2026

march based reorganization

Greptile Summary

This PR is a large-scale "march-based reorganization" that splits monolithic per-type SIMD implementation files (e.g. euclidean_distance_matrix_fp32.cc) into separate per-ISA translation units (_sse.cc, _avx2.cc, _avx512.cc, _neon.cc, _dispatch.cc), introduces shared macro utility headers (.i files), and updates the build system to assign the correct per-file -march= flags to each group. Several previously-flagged bugs (NEON march flag, AVX512 tail guard, off-by-one in FP16 batch, norm2 silent no-op, dead AVX512FP16 branch) appear to have been fixed in this iteration.

Key issues found:

  • FMA intrinsics in SSE translation units (compile error): All *_sse.cc files directly call _mm_fmadd_ps, a Fused Multiply-Add intrinsic that requires __FMA__ (Haswell/core-avx2 or later). These files are compiled with MATH_MARCH_FLAG_SSE = "-march=corei7" (Nehalem, SSE4.2 only, no FMA), so they will fail to compile. Either change MATH_MARCH_FLAG_SSE to a march that includes FMA (e.g. "-march=haswell"), or replace _mm_fmadd_ps(a, b, c) with _mm_add_ps(_mm_mul_ps(a, b), c) in all SSE paths. This affects euclidean_distance_matrix_fp32_sse.cc, inner_product_matrix_fp32_sse.cc, mips_euclidean_distance_matrix_fp32_sse.cc, and similar files.

  • FMA_INT4_ITER_AVX parameter name mismatch in distance_matrix_mips_utility.i: The macro's 4th parameter is declared ymm_sum1 but the body references ymm_sum_1, causing a compile error at all AVX2 MIPS int4 call sites.

  • setup_compiler_march_for_x86 silently falls back to core-avx2 for the AVX512 group when no AVX512-capable toolchain is found. This should either set the variable to empty or explicitly communicate that AVX512 files will be skipped/omitted.

Confidence Score: 1/5

  • Not safe to merge — the SSE translation units use FMA intrinsics incompatible with the assigned -march=corei7 flag, causing a build failure on any x86 build.
  • The FMA-in-SSE issue is a hard compile error that will break the build for all x86 targets. Additionally, the FMA_INT4_ITER_AVX parameter name mismatch is also a compile error for MIPS int4 AVX2 code. These blocking issues need to be resolved before the PR can land.
  • All *_sse.cc files (euclidean_distance_matrix_fp32_sse.cc, inner_product_matrix_fp32_sse.cc, mips_euclidean_distance_matrix_fp32_sse.cc, etc.) and src/ailego/math/distance_matrix_mips_utility.i.

Important Files Changed

Filename Overview
src/ailego/math/euclidean_distance_matrix_fp32_sse.cc New SSE-specific implementation file that directly uses _mm_fmadd_ps (FMA intrinsic) throughout, but is compiled with -march=corei7 which does not enable FMA — will fail to compile.
cmake/option.cmake Refactored to replace _detect_armv8_best/_detect_x86_best with simpler _setup_armv8_march/_setup_x86_march and new setup_compiler_march_for_x86 that returns per-file flags; AVX512 fallback incorrectly falls back to core-avx2 with confusing warning.
src/ailego/CMakeLists.txt Adds per-file march flag logic for x86/ARM: dispatch files are still grouped under the AVX512 bucket (previously flagged), ARM NEON path now correctly sets MATH_MARCH_FLAG_NEON (previous issue fixed).
src/ailego/math/distance_matrix_inner_product_utility.i New shared macro utility for inner product computations; FMA_INT8_GENERAL is defined twice with different signatures (3-param at line 77 and 5-param at line 106), causing a macro redefinition; NEGZEROS_FP32_AVX is commented out but its AVX512 equivalent is also missing.
src/ailego/math/distance_matrix_mips_utility.i New shared macro utility for MIPS distance; FMA_INT4_ITER_AVX has a parameter name mismatch (ymm_sum1 vs ymm_sum_1) that prevents correct substitution and causes compile errors at all call sites.
src/ailego/math/inner_product_matrix_fp16_dispatch.cc New dispatch file for FP16 inner product; correctly checks AVX512FP16 before AVX512F before AVX (previously dead-code issue has been resolved).
src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc Tail-element boundary check now correctly uses <= (off-by-one previously flagged was fixed); logic and structure look correct.
src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx512.cc AVX512FP16 and AVX512F paths cleanly separated; tail guard now uses < (previously-flagged always-true issue resolved). AVX512F path has appropriate 16- and 8-element sub-loops after the main 32-wide loop.
src/ailego/math/norm2_matrix_fp32.cc Silent no-op issue resolved — now uses nested independent #if guards instead of #if/#elif chain, so all tiers are compiled in and selected at runtime correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    SRC["Source Files (*.cc)"]
    SRC --> SSE["*_sse.cc\n(MATH_MARCH_FLAG_SSE\n= -march=corei7)"]
    SRC --> AVX2["*_avx2.cc / *_avx.cc\n(MATH_MARCH_FLAG_AVX2\n= -march=core-avx2)"]
    SRC --> AVX512["*_avx512.cc + *_dispatch.cc\n(MATH_MARCH_FLAG_AVX512\n= best or fallback to core-avx2)"]
    SRC --> NEON["*_neon.cc + *_dispatch.cc\n(MATH_MARCH_FLAG_NEON\n= -march=armv8-a)"]

    SSE -->|"Links into"| LIB["zvec_ailego static lib"]
    AVX2 -->|"Links into"| LIB
    AVX512 -->|"Links into"| LIB
    NEON -->|"Links into"| LIB

    LIB --> DISP["*_dispatch.cc\n(Runtime CPU feature check)"]
    DISP -->|"AVX512F detected"| CALL_AVX512["Calls *AVX512* functions"]
    DISP -->|"AVX2 detected"| CALL_AVX2["Calls *AVX2* functions"]
    DISP -->|"Fallback"| CALL_SSE["Calls *SSE* functions"]

    SSE -.->|"❌ _mm_fmadd_ps requires FMA\nnot in -march=corei7"| BUG["COMPILE ERROR"]
Loading

Comments Outside Diff (3)

  1. src/ailego/math_batch/inner_product_distance_batch.h, line 84-110 (link)

    Missing GetQueryPreprocessFunc in float and Float16 specializations causes a compile error

    The outer InnerProductDistanceBatch::GetQueryPreprocessFunc() (line 84) unconditionally calls InnerProductDistanceBatchImpl<ValueType, 1>::GetQueryPreprocessFunc() for every value type. However:

    • InnerProductDistanceBatchImpl<float, 1> (line 100) declares no GetQueryPreprocessFunc.
    • InnerProductDistanceBatchImpl<ailego::Float16, 1> (line 91) declares no GetQueryPreprocessFunc.

    Full template specializations do not inherit members from the primary template. As a result, any code path that instantiates InnerProductDistanceBatch<float, ...>::GetQueryPreprocessFunc() or InnerProductDistanceBatch<ailego::Float16, ...>::GetQueryPreprocessFunc() will produce a compile error: error: no member named 'GetQueryPreprocessFunc' in 'InnerProductDistanceBatchImpl<float, 1>'.

    Both specializations need the method added. Since floating-point types require no preprocessing, the implementation should simply return nullptr:

    template <>
    struct InnerProductDistanceBatchImpl<float, 1> {
      using ValueType = float;
      static void compute_one_to_many(const float *query, const float **ptrs,
                                      std::array<const float *, 1> &prefetch_ptrs,
                                      size_t dim, float *sums);
      static DistanceBatchQueryPreprocessFunc GetQueryPreprocessFunc() {
        return nullptr;
      }
    };

    The same fix is needed for InnerProductDistanceBatchImpl<ailego::Float16, 1>.

  2. src/ailego/math_batch/inner_product_distance_batch_impl_int8_avx2.cc, line 600-603 (link)

    Heap allocation inside hot SIMD loop

    std::vector<__m256i> data_regs(dp_batch) is declared inside the innermost computation loop, triggering a dynamic heap allocation on every iteration. Since dp_batch is a compile-time template parameter, this should be a std::array (and moved outside the loop body). The accs vector just above also heap-allocates per call.

    Contrast with inner_product_distance_batch_impl_fp32_avx2.cc and the AVX512 variants, which correctly use std::array<..., dp_batch> for all accumulators. Using std::vector here defeats the purpose of the SIMD optimisation for this hot path.

    Also move accs from a std::vector to std::array<__m256i, dp_batch> accs (outside the loop, as it already is, just change the type).

  3. src/ailego/math_batch/inner_product_distance_batch_impl_int8_avx512.cc, line 754 (link)

    Non-portable POSIX type u_int8_t

    u_int8_t is a POSIX extension; it is not part of standard C++. The portable equivalent is uint8_t from <cstdint> (or <stdint.h>), which is available on all C++11-and-later targets including MSVC.

Last reviewed commit: 39404d1

Greptile also left 3 inline comments on this PR.

@greptile-apps
Copy link

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR performs a large-scale refactoring of SIMD math kernel files, splitting monolithic per-type files (e.g. euclidean_distance_matrix_fp32.cc) into separate _sse, _avx, _avx2, _avx512, _neon, and _dispatch translation units, each compiled with targeted -march flags via the new setup_compiler_march_for_x86 CMake helper.

Key issues found:

  1. cmake/option.cmake:79 — Undefined variable in ARM march check: The _setup_armv8_march() function references undefined variable _ver instead of _arch in the compiler flag check, causing the validation to test an empty flag and trivially succeed.

  2. src/ailego/CMakeLists.txt:98 — Unset MATH_MARCH_FLAG_NEON variable: The ARM code path uses ${MATH_MARCH_FLAG_NEON} in per-file compiler flags but never assigns the variable, resulting in empty compile flags for all ARM NEON/dispatch files instead of the intended -march=armv8-a+simd (or similar).

  3. src/ailego/math/inner_product_matrix_fp16_dispatch.cc:145 — Dead code in FP16 sparse dispatch: The preprocessor checks __AVX__ before __AVX512FP16__. Since AVX512FP16 implies AVX (both are defined when compiling with -march=sapphirerapids or similar), the #elif defined(__AVX512FP16__) branch is unreachable, preventing the more optimal InnerProductSparseInSegmentAVX512FP16 path from ever being used.

  4. src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc:67 — Off-by-one boundary check: After the 16-wide main loop, when exactly 8 elements remain, the condition dim + 8 < dimensionality is false and those elements fall through to the scalar loop instead of the 8-wide SIMD path. Use <= to capture that boundary case.

Confidence Score: 2/5

  • Not safe to merge — build system bugs will silently misconfigure ARM targets, and ISA dispatch/boundary check bugs degrade performance on x86.
  • Two CMake-level bugs (undefined _ver in ARM flag check, unset MATH_MARCH_FLAG_NEON) mean ARM SIMD files will be compiled without intended march flags. The inverted ISA priority in FP16 sparse dispatch makes the AVX512FP16 path permanently unreachable. The boundary check off-by-one skips the 8-wide SIMD path for edge cases. Together these affect correctness of the build system and performance of production code paths.
  • cmake/option.cmake and src/ailego/CMakeLists.txt (ARM march variable bugs) must be fixed before any ARM target builds; src/ailego/math/inner_product_matrix_fp16_dispatch.cc needs ISA check order corrected; src/ailego/math_batch/inner_product_distance_batch_impl_fp16_avx2.cc boundary condition fix.

Last reviewed commit: a353f42

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

Comment on lines +140 to +160
#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \
ymm_sum_norm1, ymm_sum_norm2) \
{ \
__m256i ymm_lhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \
__m256i ymm_rhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \
__m256i ymm_lhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \
__m256i ymm_rhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \
FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \
FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \
FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \
FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMA_INT4_ITER_AVX parameter name mismatch causes compile error

The macro's 4th parameter is named ymm_sum1 (no underscore before 1), but the body at line 154 references ymm_sum_1 (with an underscore). These are different preprocessor tokens, so the parameter is never substituted.

At the call site in mips_euclidean_distance_matrix_int4_avx2.cc lines 39 and 62, the code passes ymm_sum1 as the 4th argument — but the local variable in that function is declared as ymm_sum_1 (with underscore), so ymm_sum1 is undefined at the call site. This will fail to compile with "use of undeclared identifier 'ymm_sum1'".

The fix is to align the parameter name with what the body uses:

Suggested change
#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \
ymm_sum_norm1, ymm_sum_norm2) \
{ \
__m256i ymm_lhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \
__m256i ymm_rhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \
__m256i ymm_lhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \
__m256i ymm_rhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \
FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \
FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \
FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \
FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \
}
#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum_1, \
ymm_sum_norm1, ymm_sum_norm2) \

@richyreachy
Copy link
Collaborator Author

@greptile

Comment on lines +35 to +36
xmm_sum_0 = _mm_fmadd_ps(xmm_d_0, xmm_d_0, xmm_sum_0);
xmm_sum_1 = _mm_fmadd_ps(xmm_d_1, xmm_d_1, xmm_sum_1);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMA intrinsic used in SSE-only translation unit

_mm_fmadd_ps is a Fused Multiply-Add (FMA) intrinsic that requires the FMA ISA extension, which is introduced by Haswell (-march=core-avx2). The per-file compile flag MATH_MARCH_FLAG_SSE is set to "-march=corei7" (Intel Nehalem/Core i7, SSE4.2 only), which does not define __FMA__. Compiling this file without FMA support will fail at the preprocessor/compiler stage because _mm_fmadd_ps is gated behind #ifdef __FMA__ in standard intrinsic headers.

The same problem exists throughout this file (lines 35, 36, 41, 49, 50, 55) and also in the other _sse.cc files: inner_product_matrix_fp32_sse.cc, mips_euclidean_distance_matrix_fp32_sse.cc, etc.

The fix is either to:

  1. Change MATH_MARCH_FLAG_SSE from "-march=corei7" to a march that includes FMA (e.g. "-march=haswell" or "-march=core-avx2"), OR
  2. Replace _mm_fmadd_ps(a, b, c) with an explicit _mm_add_ps(_mm_mul_ps(a, b), c) in the SSE code paths so no FMA instruction is required.
Suggested change
xmm_sum_0 = _mm_fmadd_ps(xmm_d_0, xmm_d_0, xmm_sum_0);
xmm_sum_1 = _mm_fmadd_ps(xmm_d_1, xmm_d_1, xmm_sum_1);
xmm_sum_0 = _mm_add_ps(_mm_mul_ps(xmm_d_0, xmm_d_0), xmm_sum_0);
xmm_sum_1 = _mm_add_ps(_mm_mul_ps(xmm_d_1, xmm_d_1), xmm_sum_1);

Comment on lines +104 to +111

function(setup_compiler_march_for_x86 VAR_NAME_SSE VAR_NAME_AVX2 VAR_NAME_AVX512)
#sse
set(${VAR_NAME_SSE} "-march=corei7" PARENT_SCOPE)

#avx 2
set(${VAR_NAME_AVX2} "-march=core-avx2" PARENT_SCOPE)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AVX512 fallback silently upgrades SSE-only builds to AVX2

When no AVX512 architecture flag is accepted by the compiler (i.e. the probing loop for graniterapids, emeraldrapids, sapphirerapids, skylake-avx512 all fail), VAR_NAME_AVX512 is set to "-march=core-avx2". This means that on a machine whose compiler only supports up to AVX2, the "AVX512 group" files (including all _dispatch.cc and _avx512.cc files) will be compiled with core-avx2 flags.

While this avoids using unavailable AVX512 instructions, it silently changes the expected ABI for the dispatch files — the dispatch files contain runtime CPU detection and call the _avx512.cc functions only when AVX512 is confirmed at runtime. If the _avx512.cc files were compiled with core-avx2 instead, then calling them on a non-AVX512 CPU at runtime would work, but the AVX512 intrinsics inside those files still require AVX512 flags to compile in the first place. The warning message should make clear that the _avx512.cc files may not be fully compiled:

Suggested change
function(setup_compiler_march_for_x86 VAR_NAME_SSE VAR_NAME_AVX2 VAR_NAME_AVX512)
#sse
set(${VAR_NAME_SSE} "-march=corei7" PARENT_SCOPE)
#avx 2
set(${VAR_NAME_AVX2} "-march=core-avx2" PARENT_SCOPE)
set(${VAR_NAME_AVX512} "" PARENT_SCOPE)
message(WARNING "No known avx512 microarchitecture flag found. AVX512 implementations will not be built.")

Comment on lines +140 to +159
#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \
ymm_sum_norm1, ymm_sum_norm2) \
{ \
__m256i ymm_lhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \
__m256i ymm_rhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \
__m256i ymm_lhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \
__m256i ymm_rhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \
FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \
FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \
FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \
FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMA_INT4_ITER_AVX 4th parameter name mismatch — compile error

The macro's 4th parameter is declared as ymm_sum1 (no underscore before 1) on the definition line, but the body at line 154 references ymm_sum_1 (with underscore). These are different preprocessor tokens — the parameter will never be substituted, and the macro body will reference an undeclared identifier at every call site.

Suggested change
#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum1, \
ymm_sum_norm1, ymm_sum_norm2) \
{ \
__m256i ymm_lhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_lhs), MASK_INT4_AVX)); \
__m256i ymm_rhs_0 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, _mm256_and_si256((ymm_rhs), MASK_INT4_AVX)); \
__m256i ymm_lhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_lhs), 4), MASK_INT4_AVX)); \
__m256i ymm_rhs_1 = _mm256_shuffle_epi8( \
INT4_LOOKUP_AVX, \
_mm256_and_si256(_mm256_srli_epi32((ymm_rhs), 4), MASK_INT4_AVX)); \
FMA_INT8_AVX(ymm_lhs_0, ymm_rhs_0, ymm_sum_0); \
FMA_INT8_AVX(ymm_lhs_1, ymm_rhs_1, ymm_sum_1); \
FMA_INT8_AVX(ymm_lhs_0, ymm_lhs_0, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_lhs_1, ymm_lhs_1, ymm_sum_norm1); \
FMA_INT8_AVX(ymm_rhs_0, ymm_rhs_0, ymm_sum_norm2); \
FMA_INT8_AVX(ymm_rhs_1, ymm_rhs_1, ymm_sum_norm2); \
}
#define FMA_INT4_ITER_AVX(ymm_lhs, ymm_rhs, ymm_sum_0, ymm_sum_1, \
ymm_sum_norm1, ymm_sum_norm2) \

const ValueType *q,
size_t dim,
float *out) {
#if defined(__ARM_NEON)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test performance difference with zvec-turbo style and adjust implementation based on the results. This can be handled either in this PR or in another this PR after this PR is merged.

void SquaredEuclideanDistanceMatrix<float, 1, 1>::Compute(const ValueType *m,
                                                          const ValueType *q,
                                                          size_t dim,
                                                          float *out) {
    static float (*impl_func_)(const float*, const float*, size_t);
    std::call_once(...{
        // set impl_func_
        }
     );
     *out = impl_func_(m, q, dim);
     return;
}

Copy link
Collaborator

@egolearner egolearner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants