Skip to content

feat: enable icelake and l2 batch distance for int8 quantization.#213

Open
luoxiaojian wants to merge 5 commits intoalibaba:mainfrom
luoxiaojian:ab
Open

feat: enable icelake and l2 batch distance for int8 quantization.#213
luoxiaojian wants to merge 5 commits intoalibaba:mainfrom
luoxiaojian:ab

Conversation

@luoxiaojian
Copy link
Collaborator

@luoxiaojian luoxiaojian commented Mar 11, 2026

Greptile Summary

This PR enables IcelAke (AVX-512 VNNI) batch distance computation for INT8 quantization across all metrics, and extends the L2 (SquaredEuclidean) metric to use the same VNNI-accelerated batch path that already existed for MinusInnerProduct and CosineMinusInnerProduct.

To support the VNNI kernel's sign-correction requirement (the kernel shifts query values by +128, so raw dot-products must be adjusted by 128 × Σmᵢ), the stored INT8 vector format is extended: all INT8 vectors now carry both squared_sum and int8_sum fields in their metadata tail, growing the extra dimension from 16 → 20 bytes (and cosine from 20 → 24). The RecordQuantizer, reformers, converters, and all metric offset constants are updated consistently.

Key changes and concerns:

  • New IcelAke CMake option: ENABLE_ICELAKE is added and icelake-server is inserted at the correct priority in the auto-detect list, but the manual add_arch_flag block is placed out of the expected architecture ordering (after Skylake instead of after Skylake-AVX512).
  • L2 batch dispatch added: BaseDistanceBatchWithScoreUnquantized::ComputeBatch now correctly routes SquaredEuclidean<int8_t> to the specialized struct, which applies the VNNI correction term when IcelAke is active.
  • MipsSquaredEuclidean latent bug: MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatch uses InnerProductDistanceBatch (VNNI-capable) but does not apply the 128 × int8_sum correction. While this code is currently unreachable (the batch dispatch falls to the scalar _ComputeBatch path), it would silently produce incorrect MIPS distances if a direct dispatch path were ever added.
  • Breaking format change: The int8_sum field is now stored for all INT8 metrics regardless of whether it is needed. Existing serialized indexes built with the old 16-byte tail format are silently incompatible — no version guard or load-time check prevents serving stale indexes with the new code.
  • std::array → C-style arrays: Cosmetic refactoring in all three inner_product_distance_batch_impl_*.h headers; no functional impact.

Confidence Score: 2/5

  • This PR is risky to merge without addressing the latent MIPS correction bug and the silent on-disk format incompatibility.
  • The core L2 VNNI path (SquaredEuclidean) is correctly implemented. However, MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatch is structurally incorrect — it calls the VNNI-capable inner product kernel without applying the required 128×int8_sum bias correction. Even if currently dead code, it is a bug waiting to surface. More critically, the INT8 vector format change (16→20 extra bytes) breaks backward compatibility with existing stored indexes with no detection mechanism at load time, meaning previously built indexes will silently produce wrong search results if served by the updated binary.
  • src/core/metric/quantized_integer_metric_batch.h (MipsSquaredEuclidean correction), src/core/quantizer/record_quantizer.h (format break), src/core/quantizer/integer_quantizer_reformer.cc (format version guard)

Important Files Changed

Filename Overview
cmake/option.cmake Adds ENABLE_ICELAKE option and icelake-server arch detection; add_arch_flag block is placed after Skylake rather than after Skylake-AVX512, inconsistent with the rest of the file's ordering.
src/core/metric/quantized_integer_metric.cc Adds GetQueryPreprocessFunc() dispatch for SquaredEuclidean INT8 (IcelAke VNNI preprocessing); always returns non-null (a no-op on non-IcelAke targets), consistent with the existing cosine pattern.
src/core/metric/quantized_integer_metric_batch.h Core changes: adds SquaredEuclidean dispatch in BaseDistanceBatch, updates dim offsets (16→20 for L2/MIPS, 20→24 for Cosine), adds int8_sum VNNI correction for SquaredEuclidean, fixes uint8 signature bug. Critical latent bug: MipsSquaredEuclideanDistanceBatchWithScoreUnquantized::ComputeBatch uses VNNI-capable InnerProductDistanceBatch without applying the 128*int8_sum correction.
src/core/metric/quantized_integer_metric_matrix.h Updates dim offsets consistently: SquaredEuclidean/MinusInnerProduct/MipsSquaredEuclidean INT8 from dim-16 to dim-20; CosineMinusInnerProduct from dim-20 to dim-24. Matrix path does not read int8_sum (correct, since InnerProductMatrix is non-VNNI).
src/core/quantizer/cosine_converter.cc Updates INT8 cosine ExtraDimension from 20 to 24 in both CosineConverterHolder and CosineConverter (consistently in both places). The converter appends 4-byte norm after RecordQuantizer's 20-byte layout, totalling 24 bytes.
src/core/quantizer/cosine_reformer.cc Updates INT8 cosine ExtraDimension from 20 to 24 consistently with cosine_converter.cc.
src/core/quantizer/integer_quantizer_converter.cc Updates ExtraDimension for INT8 from 4sizeof(float)=16 to 4sizeof(float)+sizeof(int32_t)=20, consistent with the new record layout storing both squared_sum and int8_sum.
src/core/quantizer/integer_quantizer_reformer.cc Updates extra_dimension_ for INT8 from 16 to 20 in IntegerStreamingReformer constructor, matching the new format.
src/core/quantizer/record_quantizer.h Now unconditionally writes squared_sum (extras[3]) and int8_sum (extras+4 as int32) for all INT8 vectors regardless of metric. This is a backward-incompatible format change (extra bytes grow 16→20) with no version guard or migration path for existing stored indexes.

Sequence Diagram

sequenceDiagram
    participant App
    participant Metric as QuantizedIntegerMetric
    participant Base as BaseDistanceBatch
    participant SE as SquaredEuclideanBatch<int8>
    participant IP as InnerProductDistanceBatch<int8>
    participant VNNI as IcelAke VNNI Kernel

    App->>Metric: query_preprocess_func()
    Metric-->>App: SE::QueryPreprocess (always non-null)

    App->>SE: QueryPreprocess(query, dim)
    SE->>IP: GetQueryPreprocessFunc()
    alt IcelAke available
        IP-->>SE: vnni_preprocess_fn
        SE->>VNNI: vnni_preprocess_fn(query, dim-20)
        Note over VNNI: shifts int8 → uint8 (+128)
    else Non-IcelAke
        IP-->>SE: nullptr (no-op)
    end

    App->>Metric: batch_distance()
    Metric-->>App: Base::ComputeBatch<SquaredEuclidean, int8, 12, 2>

    App->>Base: ComputeBatch(vecs, query, num, dim, out)
    Base->>SE: ComputeBatch(vecs, query, num, dim-20, out)
    SE->>IP: ComputeBatch (VNNI or scalar)
    IP-->>SE: raw dot-products in out[]

    alt IcelAke active (GetQueryPreprocessFunc != nullptr)
        SE->>SE: out[i] -= 128 * int8_sum (correction)
    end

    SE->>SE: Apply scale/bias/sum/sq_sum → L2 distance
    SE-->>App: corrected SquaredEuclidean distances
Loading

Comments Outside Diff (2)

  1. src/core/metric/quantized_integer_metric_batch.h, line 293-320 (link)

    MipsSquaredEuclidean batch path missing int8_sum correction

    MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatch uses ailego::DistanceBatch::InnerProductDistanceBatch<int8_t> — the same underlying VNNI-capable kernel used by SquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>. On IcelAke, this kernel computes the inner product treating int8 as uint8 (offset-by-128 trick), which requires the post-computation correction:

    result -= 128 * int8_sum;

    SquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t> correctly applies this correction (line ~222) when ImplType::GetQueryPreprocessFunc() != nullptr. MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t> does not apply any such correction, even though it uses the same kernel. If invoked on IcelAke, the computed distances will be systematically wrong.

    Note: this path is currently unreachable through BaseDistanceBatchWithScoreUnquantized (which dispatches MipsSquaredEuclidean to _ComputeBatch), so there is no immediate runtime regression. However, the struct should either be fixed for consistency or clearly annotated as intentionally incomplete/unused.

  2. src/core/metric/quantized_integer_metric_batch.h, line 293-320 (link)

    Missing int8_sum correction in MipsSquaredEuclidean batch path

    MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatch calls ailego::DistanceBatch::InnerProductDistanceBatch<int8_t> (the same VNNI-capable kernel used by SquaredEuclidean), but unlike its sibling, it never applies the 128 * int8_sum correction.

    When the IcelAke VNNI kernel is active, InnerProductDistanceBatch processes the query as unsigned (after preprocessing adds 128 to each element). The raw inner-product result is therefore Σ(qᵢ + 128) × mᵢ = Σqᵢmᵢ + 128 × int8_sum. Without the subtraction of 128 * int8_sum, the *results value fed into the final Euclidean formula is wrong, and the computed distance will be incorrect.

    SquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatch applies this correction correctly:

    if (ImplType::GetQueryPreprocessFunc() != nullptr) {
        int int8_sum = reinterpret_cast<const int *>(m_tail)[4];
        result -= 128 * int8_sum;
    }

    The same guard + subtraction is missing from MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatch. While the current dispatch path falls to _ComputeBatch (the scalar/matrix path) for MipsSquaredEuclidean and this struct is not reachable today, it is a latent bug: the struct's ComputeBatch calls the VNNI-capable kernel and would produce silently wrong distances if a dispatch path is ever added.

Last reviewed commit: 81ee36a

Greptile also left 2 inline comments on this PR.

@luoxiaojian
Copy link
Collaborator Author

@greptile

Comment on lines +166 to +168
if(ENABLE_ICELAKE)
add_arch_flag("-march=icelake-server" ICELAKE ENABLE_ICELAKE)
endif()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENABLE_ICELAKE add_arch_flag is out of architectural order

In the ARCH_OPTIONS list (line 36), ENABLE_ICELAKE is correctly positioned between ENABLE_SKYLAKE_AVX512 and ENABLE_SAPPHIRERAPIDS, reflecting its ISA capability level. However, in the manual-arch add_arch_flag section below, the IcelAke block is placed after ENABLE_SKYLAKE and before ENABLE_BROADWELL — both of which are significantly weaker microarchitectures.

While the independent if(ENABLE_*) blocks are functionally correct regardless of order, placing IcelAke after the weaker Skylake is inconsistent with the rest of the file and could mislead readers about the intended capability ordering. Consider moving the IcelAke block to sit directly after the ENABLE_SKYLAKE_AVX512 block:

if(ENABLE_SKYLAKE_AVX512)
  add_arch_flag("-march=skylake-avx512" SKYLAKE_AVX512 ENABLE_SKYLAKE_AVX512)
endif()

if(ENABLE_ICELAKE)
  add_arch_flag("-march=icelake-server" ICELAKE ENABLE_ICELAKE)
endif()

if(ENABLE_SKYLAKE)
  add_arch_flag("-march=skylake" SKYLAKE ENABLE_SKYLAKE)
endif()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +78 to 87
if (type == IndexMeta::DataType::DT_INT8) {
extras[3] = squared_sum;
reinterpret_cast<int32_t *>(extras + 4)[0] = int8_sum;
} else {
reinterpret_cast<int *>(extras)[3] = int8_sum;
if (is_euclidean) {
extras[3] = squared_sum;
} else {
reinterpret_cast<int *>(extras)[3] = int8_sum;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional squared_sum write for all INT8 metrics is a silent format break

Previously, the extras[3] slot was either squared_sum (euclidean) or int8_sum (inner-product), controlled by the is_euclidean flag. The new code always writes squared_sum at extras[3] and int8_sum at extras[4] for all INT8 vectors, regardless of metric.

This is a backward-incompatible on-disk format change: any existing INT8 index (with 16 extra bytes per vector) will silently read wrong metric params with the new code (which now expects 20 bytes). Similarly, a cosine INT8 index migrates from 20 → 24 extra bytes.

There is no explicit version guard, migration helper, or assertion that will catch a mismatch at load time. If existing indexes are expected to continue working, a format-version check or a clear incompatibility error would prevent silent corruption of search results.

@luoxiaojian luoxiaojian requested a review from richyreachy March 11, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant