Conversation
| std::vector<char> buffer(neighbor_size_); | ||
| NeighborsHeader *hd = reinterpret_cast<NeighborsHeader *>(buffer.data()); | ||
| hd->neighbor_cnt = neighbors.size(); | ||
| size_t i = 0; | ||
| for (; i < neighbors.size(); ++i) { | ||
| hd->neighbors[i] = neighbors[i].first; | ||
| } | ||
|
|
||
| auto loc = get_neighbor_chunk_loc(level, id); | ||
| size_t size = reinterpret_cast<char *>(&hd->neighbors[i]) - &buffer[0]; | ||
| size_t ret = loc.first->write(loc.second, hd, size); | ||
| if (ailego_unlikely(ret != size)) { | ||
| LOG_ERROR("Write neighbor header failed, ret=%zu", ret); | ||
|
|
||
| return IndexError_Runtime; | ||
| } | ||
|
|
There was a problem hiding this comment.
Buffer undersized for upper-level neighbors
update_neighbors always allocates a buffer of neighbor_size_ (the level-0 size, computed from l0_neighbor_cnt()). When called with level > 0, the write uses this same buffer, but the relevant upper-level capacity is upper_neighbor_cnt(). By default l0_neighbor_cnt = 100 and upper_neighbor_cnt = 50, so the buffer happens to be large enough in practice — however, if a user ever configures upper_max_neighbor_cnt > l0_max_neighbor_cnt (or the multiplier kDefaultL0MaxNeighborCntMultiplier is set to a value less than 1), the buffer will be too small, resulting in a heap buffer overflow when iterating over neighbors.
The buffer should be sized by the maximum of the two neighbor sizes, or selected dynamically based on the level argument:
std::vector<char> buffer(level == 0 ? neighbor_size_ : upper_neighbor_size_);| int append(const std::string & /*id*/, size_t /*size*/) override { | ||
| return IndexError_NotImplemented; | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
Silent success on append in read-only storage
FileReadStorage is a read-only storage implementation. Previously append returned IndexError_NotImplemented, which would surface any caller that accidentally invoked append on it. This change makes append silently succeed (return 0), which could mask bugs where write operations are erroneously routed to a read-only storage. The only callers that should reach this path are ones that genuinely expect a no-op here — if that is the intent, a comment explaining why this is safe (and why the previous error was wrong) would make this much clearer.
| private: | ||
| HnswStreamerEntityNew(const HnswStreamerEntityNew &) = delete; | ||
| HnswStreamerEntityNew &operator=(const HnswStreamerEntityNew &) = delete; | ||
| static constexpr uint64_t kUpperHashMemoryInflateRatio = 2.0f; |
There was a problem hiding this comment.
Type mismatch: float literal assigned to uint64_t constant
kUpperHashMemoryInflateRatio is declared as constexpr uint64_t but initialised with the float literal 2.0f. This is a narrowing conversion (the value truncates to 2). The constant name ("ratio") and value (2.0f) strongly suggest it should be a floating-point type; the uint64_t declaration is incorrect and will emit a compiler narrowing-conversion warning.
| static constexpr uint64_t kUpperHashMemoryInflateRatio = 2.0f; | |
| static constexpr float kUpperHashMemoryInflateRatio = 2.0f; |
| void print_key_map() const { | ||
| std::cout << "key map begins" << std::endl; | ||
|
|
||
| auto iter = keys_map_->begin(); | ||
| while (iter != keys_map_->end()) { | ||
| std::cout << "key: " << iter->first << ", id: " << iter->second | ||
| << std::endl; | ||
| ; | ||
| iter++; | ||
| } | ||
|
|
||
| std::cout << "key map ends" << std::endl; | ||
| } |
There was a problem hiding this comment.
Debug print function using std::cout leaks into production code
print_key_map() writes directly to std::cout (and drags in <iostream> at line 17) rather than using the project's LOG_* macros. This is likely a leftover debug helper and is not suitable for production use. Additionally, there is a redundant semicolon on the line after std::endl. Either remove this function or replace std::cout with LOG_DEBUG / LOG_INFO and drop the <iostream> include.
void print_key_map() const {
LOG_DEBUG("key map begins");
auto iter = keys_map_->begin();
while (iter != keys_map_->end()) {
LOG_DEBUG("key: %lld, id: %u", (long long)iter->first, iter->second);
iter++;
}
LOG_DEBUG("key map ends");
}
Greptile Summary
This PR consolidates the HNSW module by removing the separate
HnswBuilder,HnswSearcher, and their corresponding entity classes (HnswBuilderEntity,HnswSearcherEntity) and replacing them with a single unifiedHnswStreamerEntityNewclass. All dependent classes (HnswAlgorithm,HnswContext,HnswDistCalculator,HnswIndexProvider,HnswStreamer) are updated to use the new entity type.Key changes and concerns:
update_neighbors(hnsw_streamer_entity_new.cc:142): The buffer is always allocated using the level-0 neighbor size (neighbor_size_), but the function is called for all levels. Ifupper_neighbor_cntis ever configured to exceedl0_neighbor_cnt, writingneighbors.size()entries into the under-sized buffer would overflow the heap.appendin read-only storage (file_read_storage.cc:291): Changingappendfrom returningIndexError_NotImplementedto0can mask accidental write operations against a read-only storage. A comment explaining the intentional no-op is needed at minimum.kUpperHashMemoryInflateRatio(hnsw_streamer_entity_new.h:706): Declared asconstexpr uint64_tbut initialised with the float literal2.0f— a narrowing conversion that should be corrected toconstexpr float.hnsw_streamer_entity_new.h:396):print_key_map()writes tostd::coutand pulls in<iostream>— should use the project'sLOG_*macros or be removed.hnsw_searcher_test.cpp(2775 lines, covering RNN search, cosine, brute-force, group search, etc.) andhnsw_builder_test.cc(543 lines) are entirely deleted with no replacement tests added in this PR. Given the PR is marked WIP, this may be addressed later, but it is worth tracking.Confidence Score: 2/5
update_neighbors, a silent error-suppression change inFileReadStorage::append, a mistyped constant, and leftover debug I/O. The elimination of the builder and searcher test suites without replacement leaves the new unified entity without adequate regression coverage.src/core/algorithm/hnsw/hnsw_streamer_entity_new.cc(buffer sizing bug),src/core/utility/file_read_storage.cc(silent append), andsrc/core/algorithm/hnsw/hnsw_streamer_entity_new.h(type mismatch + debug code).Important Files Changed
constexpr uint64_tassigned2.0f) and a debugprint_key_map()function usingstd::coutwith a stray<iostream>include that should not be in production code.update_neighborsfunction allocates a buffer sized for level-0 (neighbor_size_) regardless oflevel, which is a latent heap-buffer-overflow if upper neighbor count is ever configured larger than l0 neighbor count.append()changed from returningIndexError_NotImplementedto0, silently succeeding on a read-only storage and potentially masking callers that should not be writing to it.HnswBuilderis removed; it is unclear whether equivalent coverage is provided elsewhere.Sequence Diagram
sequenceDiagram participant S as HnswStreamer participant E as HnswStreamerEntityNew participant A as HnswAlgorithm participant B as ChunkBroker participant STG as IndexStorage Note over S,STG: Before PR: HnswBuilder/HnswSearcher/HnswStreamer each held their own entity S->>E: init(max_doc_cnt) E->>B: make_shared<ChunkBroker>(stats_) S->>E: open(storage, max_index_size, check_crc) E->>B: open(storage, max_index_size_, chunk_size_, check_crc) B->>STG: open chunks S->>A: HnswAlgorithm(entity_) [was HnswEntity, now HnswStreamerEntityNew] A->>E: add_vector(level, key, vec, &id) E->>B: alloc_chunk(NODE, idx, chunk_size) A->>E: update_neighbors(level, id, neighbors) E->>B: get_neighbor_chunk_loc(level, id) S->>E: flush(checkpoint) E->>B: flush(checkpoint) B->>STG: flush to disk S->>E: dump(dumper) E->>E: dump_segments(dumper, keys, get_level) E->>E: dump_header / dump_vectors / dump_graph_neighbors / dump_upper_neighbors S->>E: clone() E-->>S: HnswStreamerEntityNew::Pointer (shared chunks, isolated search state)Last reviewed commit: 7a62567