Skip to content

enhance: rm useless hnsw module [WIP]#211

Open
iaojnh wants to merge 8 commits intomainfrom
test/rm-hnsw-entity
Open

enhance: rm useless hnsw module [WIP]#211
iaojnh wants to merge 8 commits intomainfrom
test/rm-hnsw-entity

Conversation

@iaojnh
Copy link
Collaborator

@iaojnh iaojnh commented Mar 10, 2026

Greptile Summary

This PR consolidates the HNSW module by removing the separate HnswBuilder, HnswSearcher, and their corresponding entity classes (HnswBuilderEntity, HnswSearcherEntity) and replacing them with a single unified HnswStreamerEntityNew class. All dependent classes (HnswAlgorithm, HnswContext, HnswDistCalculator, HnswIndexProvider, HnswStreamer) are updated to use the new entity type.

Key changes and concerns:

  • Latent heap buffer overflow in update_neighbors (hnsw_streamer_entity_new.cc:142): The buffer is always allocated using the level-0 neighbor size (neighbor_size_), but the function is called for all levels. If upper_neighbor_cnt is ever configured to exceed l0_neighbor_cnt, writing neighbors.size() entries into the under-sized buffer would overflow the heap.
  • Silent success on append in read-only storage (file_read_storage.cc:291): Changing append from returning IndexError_NotImplemented to 0 can mask accidental write operations against a read-only storage. A comment explaining the intentional no-op is needed at minimum.
  • Type mismatch on kUpperHashMemoryInflateRatio (hnsw_streamer_entity_new.h:706): Declared as constexpr uint64_t but initialised with the float literal 2.0f — a narrowing conversion that should be corrected to constexpr float.
  • Debug code in production header (hnsw_streamer_entity_new.h:396): print_key_map() writes to std::cout and pulls in <iostream> — should use the project's LOG_* macros or be removed.
  • Significant test coverage loss: hnsw_searcher_test.cpp (2775 lines, covering RNN search, cosine, brute-force, group search, etc.) and hnsw_builder_test.cc (543 lines) are entirely deleted with no replacement tests added in this PR. Given the PR is marked WIP, this may be addressed later, but it is worth tracking.

Confidence Score: 2/5

  • Not safe to merge — the PR is marked WIP, contains a latent heap buffer overflow, a type mismatch on a constant, debug code in a production header, and removes over 3000 lines of tests with no replacements.
  • The architectural consolidation is well-motivated, but the implementation has a real latent heap-buffer-overflow risk in update_neighbors, a silent error-suppression change in FileReadStorage::append, a mistyped constant, and leftover debug I/O. The elimination of the builder and searcher test suites without replacement leaves the new unified entity without adequate regression coverage.
  • src/core/algorithm/hnsw/hnsw_streamer_entity_new.cc (buffer sizing bug), src/core/utility/file_read_storage.cc (silent append), and src/core/algorithm/hnsw/hnsw_streamer_entity_new.h (type mismatch + debug code).

Important Files Changed

Filename Overview
src/core/algorithm/hnsw/hnsw_streamer_entity_new.h New consolidated entity header replacing the old HnswEntity hierarchy. Contains a type mismatch (constexpr uint64_t assigned 2.0f) and a debug print_key_map() function using std::cout with a stray <iostream> include that should not be in production code.
src/core/algorithm/hnsw/hnsw_streamer_entity_new.cc Implementation of the new unified entity. The update_neighbors function allocates a buffer sized for level-0 (neighbor_size_) regardless of level, which is a latent heap-buffer-overflow if upper neighbor count is ever configured larger than l0 neighbor count.
src/core/utility/file_read_storage.cc append() changed from returning IndexError_NotImplemented to 0, silently succeeding on a read-only storage and potentially masking callers that should not be writing to it.
tests/core/algorithm/hnsw/hnsw_builder_test.cc 543-line builder test file entirely deleted. Test coverage for the old HnswBuilder is removed; it is unclear whether equivalent coverage is provided elsewhere.
tests/core/algorithm/hnsw/hnsw_searcher_test.cpp 2775-line searcher test file entirely deleted. Large regression in test coverage for search scenarios (RNN, cosine, brute-force, group search, etc.); no replacement tests are added in this PR.

Sequence Diagram

sequenceDiagram
    participant S as HnswStreamer
    participant E as HnswStreamerEntityNew
    participant A as HnswAlgorithm
    participant B as ChunkBroker
    participant STG as IndexStorage

    Note over S,STG: Before PR: HnswBuilder/HnswSearcher/HnswStreamer each held their own entity

    S->>E: init(max_doc_cnt)
    E->>B: make_shared<ChunkBroker>(stats_)
    S->>E: open(storage, max_index_size, check_crc)
    E->>B: open(storage, max_index_size_, chunk_size_, check_crc)
    B->>STG: open chunks

    S->>A: HnswAlgorithm(entity_) [was HnswEntity, now HnswStreamerEntityNew]
    A->>E: add_vector(level, key, vec, &id)
    E->>B: alloc_chunk(NODE, idx, chunk_size)
    A->>E: update_neighbors(level, id, neighbors)
    E->>B: get_neighbor_chunk_loc(level, id)

    S->>E: flush(checkpoint)
    E->>B: flush(checkpoint)
    B->>STG: flush to disk

    S->>E: dump(dumper)
    E->>E: dump_segments(dumper, keys, get_level)
    E->>E: dump_header / dump_vectors / dump_graph_neighbors / dump_upper_neighbors

    S->>E: clone()
    E-->>S: HnswStreamerEntityNew::Pointer (shared chunks, isolated search state)
Loading

Last reviewed commit: 7a62567

Greptile also left 4 inline comments on this PR.

Comment on lines +142 to +158
std::vector<char> buffer(neighbor_size_);
NeighborsHeader *hd = reinterpret_cast<NeighborsHeader *>(buffer.data());
hd->neighbor_cnt = neighbors.size();
size_t i = 0;
for (; i < neighbors.size(); ++i) {
hd->neighbors[i] = neighbors[i].first;
}

auto loc = get_neighbor_chunk_loc(level, id);
size_t size = reinterpret_cast<char *>(&hd->neighbors[i]) - &buffer[0];
size_t ret = loc.first->write(loc.second, hd, size);
if (ailego_unlikely(ret != size)) {
LOG_ERROR("Write neighbor header failed, ret=%zu", ret);

return IndexError_Runtime;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer undersized for upper-level neighbors

update_neighbors always allocates a buffer of neighbor_size_ (the level-0 size, computed from l0_neighbor_cnt()). When called with level > 0, the write uses this same buffer, but the relevant upper-level capacity is upper_neighbor_cnt(). By default l0_neighbor_cnt = 100 and upper_neighbor_cnt = 50, so the buffer happens to be large enough in practice — however, if a user ever configures upper_max_neighbor_cnt > l0_max_neighbor_cnt (or the multiplier kDefaultL0MaxNeighborCntMultiplier is set to a value less than 1), the buffer will be too small, resulting in a heap buffer overflow when iterating over neighbors.

The buffer should be sized by the maximum of the two neighbor sizes, or selected dynamically based on the level argument:

std::vector<char> buffer(level == 0 ? neighbor_size_ : upper_neighbor_size_);

Comment on lines 291 to 293
int append(const std::string & /*id*/, size_t /*size*/) override {
return IndexError_NotImplemented;
return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent success on append in read-only storage

FileReadStorage is a read-only storage implementation. Previously append returned IndexError_NotImplemented, which would surface any caller that accidentally invoked append on it. This change makes append silently succeed (return 0), which could mask bugs where write operations are erroneously routed to a read-only storage. The only callers that should reach this path are ones that genuinely expect a no-op here — if that is the intent, a comment explaining why this is safe (and why the previous error was wrong) would make this much clearer.

private:
HnswStreamerEntityNew(const HnswStreamerEntityNew &) = delete;
HnswStreamerEntityNew &operator=(const HnswStreamerEntityNew &) = delete;
static constexpr uint64_t kUpperHashMemoryInflateRatio = 2.0f;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type mismatch: float literal assigned to uint64_t constant

kUpperHashMemoryInflateRatio is declared as constexpr uint64_t but initialised with the float literal 2.0f. This is a narrowing conversion (the value truncates to 2). The constant name ("ratio") and value (2.0f) strongly suggest it should be a floating-point type; the uint64_t declaration is incorrect and will emit a compiler narrowing-conversion warning.

Suggested change
static constexpr uint64_t kUpperHashMemoryInflateRatio = 2.0f;
static constexpr float kUpperHashMemoryInflateRatio = 2.0f;

Comment on lines +396 to +408
void print_key_map() const {
std::cout << "key map begins" << std::endl;

auto iter = keys_map_->begin();
while (iter != keys_map_->end()) {
std::cout << "key: " << iter->first << ", id: " << iter->second
<< std::endl;
;
iter++;
}

std::cout << "key map ends" << std::endl;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print function using std::cout leaks into production code

print_key_map() writes directly to std::cout (and drags in <iostream> at line 17) rather than using the project's LOG_* macros. This is likely a leftover debug helper and is not suitable for production use. Additionally, there is a redundant semicolon on the line after std::endl. Either remove this function or replace std::cout with LOG_DEBUG / LOG_INFO and drop the <iostream> include.

void print_key_map() const {
  LOG_DEBUG("key map begins");
  auto iter = keys_map_->begin();
  while (iter != keys_map_->end()) {
    LOG_DEBUG("key: %lld, id: %u", (long long)iter->first, iter->second);
    iter++;
  }
  LOG_DEBUG("key map ends");
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant