Skip to content

feat: add hnsw-rabitq support#69

Open
egolearner wants to merge 41 commits intomainfrom
feat/rabitq
Open

feat: add hnsw-rabitq support#69
egolearner wants to merge 41 commits intomainfrom
feat/rabitq

Conversation

@egolearner
Copy link
Collaborator

resolve #42

@egolearner
Copy link
Collaborator Author

Depends on VectorDB-NTU/RaBitQ-Library#36

@egolearner egolearner marked this pull request as ready for review February 10, 2026 13:31
@Cuiyus
Copy link
Collaborator

Cuiyus commented Feb 27, 2026

@greptile

@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR adds comprehensive HNSW-RaBitQ support to the zvec project, implementing a vector search index that combines Hierarchical Navigable Small World (HNSW) graph structure with RaBitQ quantization for memory-efficient approximate nearest neighbor search.

Major changes:

  • New HNSW-RaBitQ algorithm implementation with builder, searcher, and streamer components
  • RaBitQ vector quantization with configurable bits (default 7-bit) and k-means clustering
  • Integration with existing zvec framework including Python bindings and protocol buffers
  • Comprehensive test coverage at both C++ and Python levels
  • Added RaBitQ-Library as git submodule dependency

Issues found:

  • Variable-length arrays (VLAs) used in algorithm files are not standard C++ and will cause compilation failures on MSVC
  • Typo in .gitmodules with duplicate "thirdparty/" prefix in submodule name

Confidence Score: 3/5

  • This PR requires fixes for compilation issues before merging
  • The implementation is comprehensive with good test coverage, but contains critical syntax issues (VLAs) that will prevent compilation on MSVC and a configuration error in git submodules. Once these are fixed, the code appears well-structured
  • Pay close attention to src/core/algorithm/hnsw-rabitq/hnsw_rabitq_algorithm.cc and hnsw_rabitq_query_algorithm.cc which have VLA issues, and .gitmodules with the submodule name typo

Important Files Changed

Filename Overview
.gitmodules Added RaBitQ-Library submodule with typo in name (duplicate thirdparty/)
src/core/algorithm/hnsw-rabitq/hnsw_rabitq_algorithm.cc Implements HNSW graph algorithm with node insertion and search; contains multiple VLAs (non-standard C++)
src/core/algorithm/hnsw-rabitq/hnsw_rabitq_query_algorithm.cc Query algorithm implementation with VLA that needs replacement with std::vector
src/core/algorithm/hnsw-rabitq/hnsw_rabitq_builder.cc Builder implementation with proper parameter validation and initialization
src/core/algorithm/hnsw-rabitq/rabitq_converter.cc RaBitQ converter with k-means clustering for vector quantization training
src/include/zvec/db/index_params.h Added HnswRabitqIndexParams class with proper integration into type system
python/tests/test_collection_hnsw_rabitq.py Comprehensive Python tests for HNSW-RaBitQ collection operations
src/core/algorithm/hnsw-rabitq/hnsw_rabitq_searcher.cc Search implementation with proper parameter validation and reformer initialization

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class HnswRabitqIndex {
        +build()
        +search()
        +stream()
    }
    
    class HnswRabitqBuilder {
        +init()
        +train()
        +build()
        -rabitq_converter_
    }
    
    class HnswRabitqSearcher {
        +init()
        +search()
        -entity_
        -reformer_
    }
    
    class HnswRabitqStreamer {
        +init()
        +add()
        +search()
        -entity_
        -reformer_
    }
    
    class HnswRabitqAlgorithm {
        +add_node()
        +search()
        -entity_
    }
    
    class HnswRabitqEntity {
        +get_neighbors()
        +update_neighbors()
        -graph_structure_
    }
    
    class RabitqConverter {
        +train()
        +transform()
        +to_reformer()
        -rotator_
        -centroids_
    }
    
    class RabitqReformer {
        +reform()
        +get_quantized()
        -rotator_
        -centroids_
    }
    
    class HnswRabitqContext {
        +dist_calculator()
        +visit_filter()
    }
    
    HnswRabitqIndex --> HnswRabitqBuilder
    HnswRabitqIndex --> HnswRabitqSearcher
    HnswRabitqIndex --> HnswRabitqStreamer
    
    HnswRabitqBuilder --> RabitqConverter
    HnswRabitqBuilder --> HnswRabitqAlgorithm
    HnswRabitqBuilder --> HnswRabitqEntity
    
    HnswRabitqSearcher --> HnswRabitqAlgorithm
    HnswRabitqSearcher --> HnswRabitqEntity
    HnswRabitqSearcher --> RabitqReformer
    
    HnswRabitqStreamer --> HnswRabitqAlgorithm
    HnswRabitqStreamer --> HnswRabitqEntity
    HnswRabitqStreamer --> RabitqReformer
    
    HnswRabitqAlgorithm --> HnswRabitqEntity
    HnswRabitqAlgorithm --> HnswRabitqContext
    
    RabitqConverter --> RabitqReformer
Loading

Last reviewed commit: 42a30d6

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

93 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

.gitmodules Outdated
path = thirdparty/magic_enum/magic_enum-0.9.7
url = https://github.com/Neargye/magic_enum.git
ignore = all
[submodule "thirdparty/thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submodule name has duplicate thirdparty/ prefix

Suggested change
[submodule "thirdparty/thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"]
[submodule "thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check and handle this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the submodule name has a duplicate thirdparty/ prefix. The path is correct but the submodule name itself is wrong. Here's the fix:

Suggested change
[submodule "thirdparty/thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"]
[submodule "thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"]

This will match the pattern used by all other submodules in the file and correctly remove the duplicate prefix.

Comment on lines +139 to +140
float dists[size];
const void *neighbor_vecs[size];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable-length arrays (VLAs) are not standard C++ and won't compile with MSVC. Use std::vector instead

Suggested change
float dists[size];
const void *neighbor_vecs[size];
std::vector<float> dists(size);
std::vector<const void*> neighbor_vecs(size);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check and handle this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the provided context, the issue is clear. Lines 139-140 use variable-length arrays (VLAs) which aren't standard C++ and won't compile on MSVC. Here's the fix:

Suggested change
float dists[size];
const void *neighbor_vecs[size];
std::vector<float> dists(size);
std::vector<const void*> neighbor_vecs(size);

This replaces the VLAs with std::vector, making the code portable across all C++ compilers including MSVC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

(*ctx->mutable_stats_get_neighbors())++;
}

node_id_t neighbor_ids[neighbors.size()];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLA not standard C++, won't compile with MSVC

Suggested change
node_id_t neighbor_ids[neighbors.size()];
std::vector<node_id_t> neighbor_ids(neighbors.size());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check and handle this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is a VLA which isn't standard C++. The fix is to use std::vector:

Suggested change
node_id_t neighbor_ids[neighbors.size()];
std::vector<node_id_t> neighbor_ids(neighbors.size());

This will compile on MSVC and all standard C++ compilers while maintaining the same functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +256 to +257
float dists[size];
const void *neighbor_vecs[size];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLAs not standard C++, won't compile with MSVC

Suggested change
float dists[size];
const void *neighbor_vecs[size];
std::vector<float> dists(size);
std::vector<const void*> neighbor_vecs(size);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check and handle this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, those are VLAs which won't compile on MSVC. Replace them with std::vector:

Suggested change
float dists[size];
const void *neighbor_vecs[size];
std::vector<float> dists(size);
std::vector<const void*> neighbor_vecs(size);

This is standard C++ and will work across all compilers. You'll also need to update the references to use .data() when passing to batch_dist:

dc.batch_dist(neighbor_vecs.data(), size, dists.data());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

(*ctx->mutable_stats_get_neighbors())++;
}

node_id_t neighbor_ids[neighbors.size()];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLA not standard C++, won't compile with MSVC

Suggested change
node_id_t neighbor_ids[neighbors.size()];
std::vector<node_id_t> neighbor_ids(neighbors.size());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pay attention to these VLA issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,4 @@
add_library(rabitqlib INTERFACE)
target_include_directories(
sparsehash INTERFACE "${CMAKE_CURRENT_SOURCE_DIR}/RaBitQ-Library-0.1/include"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparsehash -> rabitqlib?

Copy link
Collaborator

@richyreachy richyreachy Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class can be moved to quantizer folder with rabitq_reformer


void HnswRabitqQueryAlgorithm::expand_neighbors_by_group(
TopkHeap &topk, HnswRabitqContext *ctx) const {
// if (!ctx->group_by().is_valid()) {
Copy link
Collaborator

@richyreachy richyreachy Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to return with some message when group's setup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented group by search, will test it after framework support group by .

continue;
}
} else {
// Candidate cand{ResultRecord(candest.est_dist, candest.low_dist),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to check ex bits here? can it be checked earlier?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary to check ex_bits here.

  1. first get_bin_est use single bit to estimate score
  2. ex_bits > 0
    1. yes. check if need to do full estimate with extra bits
    2. no. Nothing to be done since we already used all the information.


// TODO: check loop type

// if ((!topk.full()) || cur_dist < topk[0].second) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lagacy codes here and elsewhere can be cleaned up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

query_wrapper.set_g_add(q_to_centroids[cluster_id]);
float est_dist = rabitqlib::split_distance_boosting(
ex_data, ip_func_, query_wrapper, padded_dim_, ex_bits_, res.ip_x0_qr);
float low_dist = est_dist - (res.est_dist - res.low_dist) / (1 << ex_bits_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can predefine param to convert to multiply manipulation:
double inv_divisor = 1.0 / (1 << ex_bits_);
auto result = (res.est_dist - res.low_dist) * inv_divisor;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_ex_est is not used, deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Integrate RaBitQ Quantization into HNSW Index

6 participants