feat: enable icelake and l2 batch distance for int8 quantization.#213
feat: enable icelake and l2 batch distance for int8 quantization.#213luoxiaojian wants to merge 5 commits intoalibaba:mainfrom
Conversation
|
@greptile |
cmake/option.cmake
Outdated
| if(ENABLE_ICELAKE) | ||
| add_arch_flag("-march=icelake-server" ICELAKE ENABLE_ICELAKE) | ||
| endif() |
There was a problem hiding this comment.
ENABLE_ICELAKE add_arch_flag is out of architectural order
In the ARCH_OPTIONS list (line 36), ENABLE_ICELAKE is correctly positioned between ENABLE_SKYLAKE_AVX512 and ENABLE_SAPPHIRERAPIDS, reflecting its ISA capability level. However, in the manual-arch add_arch_flag section below, the IcelAke block is placed after ENABLE_SKYLAKE and before ENABLE_BROADWELL — both of which are significantly weaker microarchitectures.
While the independent if(ENABLE_*) blocks are functionally correct regardless of order, placing IcelAke after the weaker Skylake is inconsistent with the rest of the file and could mislead readers about the intended capability ordering. Consider moving the IcelAke block to sit directly after the ENABLE_SKYLAKE_AVX512 block:
if(ENABLE_SKYLAKE_AVX512)
add_arch_flag("-march=skylake-avx512" SKYLAKE_AVX512 ENABLE_SKYLAKE_AVX512)
endif()
if(ENABLE_ICELAKE)
add_arch_flag("-march=icelake-server" ICELAKE ENABLE_ICELAKE)
endif()
if(ENABLE_SKYLAKE)
add_arch_flag("-march=skylake" SKYLAKE ENABLE_SKYLAKE)
endif()Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if (type == IndexMeta::DataType::DT_INT8) { | ||
| extras[3] = squared_sum; | ||
| reinterpret_cast<int32_t *>(extras + 4)[0] = int8_sum; | ||
| } else { | ||
| reinterpret_cast<int *>(extras)[3] = int8_sum; | ||
| if (is_euclidean) { | ||
| extras[3] = squared_sum; | ||
| } else { | ||
| reinterpret_cast<int *>(extras)[3] = int8_sum; | ||
| } | ||
| } |
There was a problem hiding this comment.
Unconditional squared_sum write for all INT8 metrics is a silent format break
Previously, the extras[3] slot was either squared_sum (euclidean) or int8_sum (inner-product), controlled by the is_euclidean flag. The new code always writes squared_sum at extras[3] and int8_sum at extras[4] for all INT8 vectors, regardless of metric.
This is a backward-incompatible on-disk format change: any existing INT8 index (with 16 extra bytes per vector) will silently read wrong metric params with the new code (which now expects 20 bytes). Similarly, a cosine INT8 index migrates from 20 → 24 extra bytes.
There is no explicit version guard, migration helper, or assertion that will catch a mismatch at load time. If existing indexes are expected to continue working, a format-version check or a clear incompatibility error would prevent silent corruption of search results.
Greptile Summary
This PR enables IcelAke (AVX-512 VNNI) batch distance computation for INT8 quantization across all metrics, and extends the L2 (
SquaredEuclidean) metric to use the same VNNI-accelerated batch path that already existed forMinusInnerProductandCosineMinusInnerProduct.To support the VNNI kernel's sign-correction requirement (the kernel shifts query values by +128, so raw dot-products must be adjusted by
128 × Σmᵢ), the stored INT8 vector format is extended: all INT8 vectors now carry bothsquared_sumandint8_sumfields in their metadata tail, growing the extra dimension from 16 → 20 bytes (and cosine from 20 → 24). TheRecordQuantizer, reformers, converters, and all metric offset constants are updated consistently.Key changes and concerns:
ENABLE_ICELAKEis added andicelake-serveris inserted at the correct priority in the auto-detect list, but the manualadd_arch_flagblock is placed out of the expected architecture ordering (after Skylake instead of after Skylake-AVX512).BaseDistanceBatchWithScoreUnquantized::ComputeBatchnow correctly routesSquaredEuclidean<int8_t>to the specialized struct, which applies the VNNI correction term when IcelAke is active.MipsSquaredEuclideanlatent bug:MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatchusesInnerProductDistanceBatch(VNNI-capable) but does not apply the128 × int8_sumcorrection. While this code is currently unreachable (the batch dispatch falls to the scalar_ComputeBatchpath), it would silently produce incorrect MIPS distances if a direct dispatch path were ever added.int8_sumfield is now stored for all INT8 metrics regardless of whether it is needed. Existing serialized indexes built with the old 16-byte tail format are silently incompatible — no version guard or load-time check prevents serving stale indexes with the new code.std::array→ C-style arrays: Cosmetic refactoring in all threeinner_product_distance_batch_impl_*.hheaders; no functional impact.Confidence Score: 2/5
MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatchis structurally incorrect — it calls the VNNI-capable inner product kernel without applying the required 128×int8_sum bias correction. Even if currently dead code, it is a bug waiting to surface. More critically, the INT8 vector format change (16→20 extra bytes) breaks backward compatibility with existing stored indexes with no detection mechanism at load time, meaning previously built indexes will silently produce wrong search results if served by the updated binary.src/core/metric/quantized_integer_metric_batch.h(MipsSquaredEuclidean correction),src/core/quantizer/record_quantizer.h(format break),src/core/quantizer/integer_quantizer_reformer.cc(format version guard)Important Files Changed
ENABLE_ICELAKEoption andicelake-serverarch detection;add_arch_flagblock is placed after Skylake rather than after Skylake-AVX512, inconsistent with the rest of the file's ordering.GetQueryPreprocessFunc()dispatch for SquaredEuclidean INT8 (IcelAke VNNI preprocessing); always returns non-null (a no-op on non-IcelAke targets), consistent with the existing cosine pattern.Sequence Diagram
sequenceDiagram participant App participant Metric as QuantizedIntegerMetric participant Base as BaseDistanceBatch participant SE as SquaredEuclideanBatch<int8> participant IP as InnerProductDistanceBatch<int8> participant VNNI as IcelAke VNNI Kernel App->>Metric: query_preprocess_func() Metric-->>App: SE::QueryPreprocess (always non-null) App->>SE: QueryPreprocess(query, dim) SE->>IP: GetQueryPreprocessFunc() alt IcelAke available IP-->>SE: vnni_preprocess_fn SE->>VNNI: vnni_preprocess_fn(query, dim-20) Note over VNNI: shifts int8 → uint8 (+128) else Non-IcelAke IP-->>SE: nullptr (no-op) end App->>Metric: batch_distance() Metric-->>App: Base::ComputeBatch<SquaredEuclidean, int8, 12, 2> App->>Base: ComputeBatch(vecs, query, num, dim, out) Base->>SE: ComputeBatch(vecs, query, num, dim-20, out) SE->>IP: ComputeBatch (VNNI or scalar) IP-->>SE: raw dot-products in out[] alt IcelAke active (GetQueryPreprocessFunc != nullptr) SE->>SE: out[i] -= 128 * int8_sum (correction) end SE->>SE: Apply scale/bias/sum/sq_sum → L2 distance SE-->>App: corrected SquaredEuclidean distancesComments Outside Diff (2)
src/core/metric/quantized_integer_metric_batch.h, line 293-320 (link)MipsSquaredEuclideanbatch path missingint8_sumcorrectionMipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatchusesailego::DistanceBatch::InnerProductDistanceBatch<int8_t>— the same underlying VNNI-capable kernel used bySquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>. On IcelAke, this kernel computes the inner product treatingint8asuint8(offset-by-128 trick), which requires the post-computation correction:result -= 128 * int8_sum;SquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>correctly applies this correction (line ~222) whenImplType::GetQueryPreprocessFunc() != nullptr.MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>does not apply any such correction, even though it uses the same kernel. If invoked on IcelAke, the computed distances will be systematically wrong.Note: this path is currently unreachable through
BaseDistanceBatchWithScoreUnquantized(which dispatchesMipsSquaredEuclideanto_ComputeBatch), so there is no immediate runtime regression. However, the struct should either be fixed for consistency or clearly annotated as intentionally incomplete/unused.src/core/metric/quantized_integer_metric_batch.h, line 293-320 (link)Missing
int8_sumcorrection inMipsSquaredEuclideanbatch pathMipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatchcallsailego::DistanceBatch::InnerProductDistanceBatch<int8_t>(the same VNNI-capable kernel used bySquaredEuclidean), but unlike its sibling, it never applies the128 * int8_sumcorrection.When the IcelAke VNNI kernel is active,
InnerProductDistanceBatchprocesses the query as unsigned (after preprocessing adds 128 to each element). The raw inner-product result is thereforeΣ(qᵢ + 128) × mᵢ = Σqᵢmᵢ + 128 × int8_sum. Without the subtraction of128 * int8_sum, the*resultsvalue fed into the final Euclidean formula is wrong, and the computed distance will be incorrect.SquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatchapplies this correction correctly:The same guard + subtraction is missing from
MipsSquaredEuclideanDistanceBatchWithScoreUnquantized<int8_t>::ComputeBatch. While the current dispatch path falls to_ComputeBatch(the scalar/matrix path) forMipsSquaredEuclideanand this struct is not reachable today, it is a latent bug: the struct'sComputeBatchcalls the VNNI-capable kernel and would produce silently wrong distances if a dispatch path is ever added.Last reviewed commit: 81ee36a