Skip to content

lib: implement hashset#470

Open
pzmarzly wants to merge 3 commits intofacebook:mainfrom
pzmarzly:push-mlqkpqspnupl
Open

lib: implement hashset#470
pzmarzly wants to merge 3 commits intofacebook:mainfrom
pzmarzly:push-mlqkpqspnupl

Conversation

@pzmarzly
Copy link
Copy Markdown
Contributor

@pzmarzly pzmarzly commented Mar 12, 2026

Currently, bf_set uses bf_list under the hood. This is suboptimal performance-wise. Here I am changing bf_set to be backed by bf_hashset.

  • lib: helper: add FNV-1a hash function - imported from another PR, a simple hash function.
  • lib: add bf_hashset - main commit. Implements a simple hashset (tombstones on deletion, grows 2x when load factor exceeds 70%, never shrinks, linear probing on collision). See https://en.wikipedia.org/wiki/Linear_probing (with "special flag value" strategy in "Deletion" section).
  • lib: set: use bf_hashset for elems field - changes bf_set's inner structure

See #460 for previous attempt.

Fixes #418

@pzmarzly pzmarzly requested a review from qdeslandes as a code owner March 12, 2026 15:00
@pzmarzly pzmarzly marked this pull request as draft March 12, 2026 15:00
@meta-cla meta-cla bot added the cla signed label Mar 12, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

Claude review of PR #470 (a34983a)

Must fix

  • foreach_break test asserts on unmodified variable — Fixed: count is now incremented before break, so count == 1 is correct.
  • No test for bf_hashset_take — Fixed: take test now in tests/unit/libbpfilter/core/hashset.c:272.
  • void * pointer arithmetic in bf_vector_foreach will break pedantic C17 integration test (dismissed) — Maintainer explicitly requested not using char * casts; both maintainer and author agreed this is a future topic.
  • Uninitialized old_slots in _bf_hashset_grow causes heap corruption on OOM — Fixed: old_slots is now a plain void ** without cleanup attribute (bf_vector removed from PR). Assigned before use.
  • Unchecked bf_hashset_add return in _bf_hashset_grow — Fixed: return value is now checked with if (r) { free(old_slots); return r; }. See new suggestion below about the error path's element-loss subtlety.
  • Missing <assert.h> in vector.h — Fixed: vector.h was removed from this PR (bf_vector split into a separate PR per reviewer feedback).
  • Load factor threshold is 50%, not 70% as describedsrc/libbpfilter/core/hashset.c:18_BF_HASHSET_MAX_LOAD_NUM=5 / _BF_HASHSET_MAX_LOAD_DEN=10 triggers growth at 50% occupancy, not 70% as stated in the PR description. Either update the constants to 7/10 for 70%, or correct the documentation. This also affects bf_hashset_reserve which computes count * 2 slots instead of count * 1.43.

Suggestions

  • Tombstone sentinel (void *)1 hardcoded in public macro — Fixed: bf_set_foreach now delegates to bf_hashset_foreach which uses bf_hashset_slot_is_tombstone() function. No literal sentinel in public headers.
  • Inconsistent typedef convention between bf_hashset and bf_vector — Fixed: both now use typedef struct bf_* pattern consistently.
  • Missing boundary test for bf_vector_new with elem_size == 0 — Fixed: new_zero_elem_size test now in tests/unit/libbpfilter/core/vector.c.
  • No test coverage for bf_vector_take and bf_vector_set_len — Fixed: take and set_len tests now in tests/unit/libbpfilter/core/vector.c.
  • Commits lack body explaining "why" — The style guide says "Use the commit description to explain why the change is necessary." PR now has 4 commits: commit 1 (lib: core: move bf_list into subfolder), commit 3 (lib: core: add bf_hashset), and commit 4 (lib: set: use bf_hashset for elems field) have no body. Commit 2 has a body but it describes "what" more than "why".
  • No test for bf_vector_data — Fixed: data test now in tests/unit/libbpfilter/core/vector.c.
  • No test for bf_hashset with NULL free callbacktests/unit/libbpfilter/core/hashset.c — All tests use non-NULL .free callback. The if (set->ops.free) branch in bf_hashset_clean is never exercised.
  • Redundant forward typedef and non-standard cleanup macro placement in vector.h — Fixed: forward typedef removed; cleanup macros now placed after struct definition.
  • bf_set_add_elem_raw performs a redundant allocation and copy — Fixed: bf_set_add_elem_raw now calls bf_hashset_add directly instead of going through bf_set_add_elem, avoiding the extra malloc+memcpy.
  • bf_hashset_take doc exposes internal tombstone sentinel value — Fixed: doc now references bf_hashset_slot_is_tombstone instead of the concrete (void *)1 value.
  • No test for bf_set_name, bf_set_n_comps, bf_set_key — Fixed: getters test now in tests/unit/libbpfilter/set.c:433-441.
  • No test for tombstone accumulation forcing rehashtests/unit/libbpfilter/core/hashset.c — The remove_and_readd test covers a single remove-readd cycle but does not exercise the scenario where many removals cause tombstone accumulation, increasing n_used until the load factor triggers _bf_hashset_grow. Consider a stress test that inserts, removes many elements, then inserts again to verify the grow-and-compact path.
  • bf_set_is_empty not explicitly testedtests/unit/libbpfilter/set.cbf_set_is_empty was modified in this PR (switched from bf_list_is_empty to bf_hashset_is_empty) but no test calls it directly. Consider adding assertions in the existing test functions.
  • API break: bpfilter/list.h renamed without compat header (dismissed) — The bf_list header rename is no longer part of this PR; it was split into a separate change.
  • bf_set_foreach not tested in set unit tests — Moot: bf_set_foreach does not exist in the current diff; callers use bf_hashset_foreach directly on set->elems.
  • _bf_hashset_insert_unchecked implicit tombstone-free precondition — Fixed: _bf_hashset_insert_unchecked was removed; _bf_hashset_grow now calls bf_hashset_add directly.
  • void * pointer arithmetic in vector.c functions (dismissed) — Maintainer explicitly requested not using char * casts.
  • bf_set_add_elem_raw not directly testedtests/unit/libbpfilter/set.c — This public function is only called indirectly through bf_set_new_from_raw. Its error paths and duplicate-element early return lack direct test coverage.
  • _bf_hashset_grow error path leaves hashset in partially-migrated state — Fixed: error path now correctly restores old_slots, cap, len, and slots_in_use before freeing new_slots. No elements are lost on failure. The error path is unreachable in practice (doubled capacity prevents nested grow, no duplicates possible), but the rollback logic is correct.
  • FNV-1a test lacks known-answer reference value — Fixed: test now asserts FNV-1a of 'a' equals 0xaf63dc4c8601ec8cULL.
  • _clean_bf_hashset_ has no safe default initializer — Fixed: bf_hashset_default() macro now defined at hashset.h:53.
  • _bf_hashset_index lacks cap > 0 precondition (dismissed) — Author states asserts are only for NULL-ness checks.
  • bf_hashset_slot_is_tombstone could be static inline in the header — Fixed: function removed from public API. Now _bf_hashset_slot_is_tombstone (static) in hashset.c. bf_hashset_foreach uses linked-list traversal and no longer needs tombstone checking.
  • bf_hashset_foreach should document iteration safety — Fixed: hashset.h:76 now reads "Do not add or remove elements during iteration."
  • Integer overflow in _bf_hashset_needs_grow load-factor check — Fixed: _BF_HASHSET_MAX_ELEMS cap (100M elements) ensures slots_in_use * 10 cannot overflow size_t even on 32-bit platforms. Maximum capacity is ~256M, so 256M * 10 = 2.56B < SIZE_MAX_32.
  • Tombstone sentinel (void *)1 not rejected by bf_hashset_add — Fixed: bf_hashset_add now checks if (data == _BF_HASHSET_TOMBSTONE) and returns -EINVAL.
  • File-static _bf_set_elem_size makes _bf_program_load_sets_maps not thread-safe — Fixed: global removed. _bf_program_load_sets_maps now uses bf_set_foreach and set->elem_size directly, no qsort needed.
  • No test for removing all elements and verifying empty state — Fixed: the foreach_remove test adds 5 elements, removes all during iteration, and asserts bf_hashset_is_empty(&set).
  • Unbounded probing loop in bf_hashset_add — Fixed: the slot-search in bf_hashset_add now uses a bounded for (size_t i = 0; i < set->cap; ++i) loop, consistent with _bf_hashset_find.
  • Tombstone encoding exposed in public struct field doc — Fixed: bf_hashset_elem.data doc now reads "Data pointer, tombstone, or NULL" without exposing the concrete sentinel value.
  • bf_set_add_elem duplicate-silencing behavior not tested — Fixed: add_duplicate test in tests/unit/libbpfilter/set.c inserts the same element twice via bf_set_add_elem and asserts size remains 1.
  • bf_hashset_remove dangling pointer after free callbacksrc/libbpfilter/core/hashset.c:277 — After set->ops.free(&elem->data, ...) runs, the data parameter may be a dangling pointer if the caller passed elem->data from the slot being removed (as foreach_remove does). The code is correct today because data is not used after the free call, but a future edit could introduce a use-after-free. Consider adding a safety comment.
  • bft_set_eq compares insertion order, not set equality — Fixed: renamed to bft_set_eq_ordered, which accurately conveys the positional comparison semantics.
  • Double probe in bf_hashset_addsrc/libbpfilter/core/hashset.c:217bf_hashset_add calls bf_hashset_contains (full probe via _bf_hashset_find), then performs a second linear-probe scan to locate a free slot for insertion. Consider extending _bf_hashset_find to also track the first available slot, allowing bf_hashset_add to skip the second scan.
  • _bf_hashset_grow rehash performs redundant checks — Fixed: _bf_hashset_resize now iterates the linked list directly and inserts into new_slots without going through bf_hashset_add, avoiding redundant duplicate/load-factor checks.
  • Document silent -EEXIST handling in set add functions (dismissed) — Author states the behavior was already present.
  • _BF_HASHSET_MAX_ELEMS lacks rationale — Fixed: now _BF_HASHSET_MAX_CAP with a comment explaining it's the largest power-of-two that leaves headroom for load-factor arithmetic without overflowing size_t.
  • bf_hashset_reserve overflow in needed computationsrc/libbpfilter/core/hashset.c:206count * _BF_HASHSET_MAX_LOAD_DEN can overflow size_t when count > SIZE_MAX / 10, wrapping needed to a small value and silently under-allocating. An overflow guard (if (count > SIZE_MAX / _BF_HASHSET_MAX_LOAD_DEN) return -ENOMEM;) would harden this public API.
  • bf_hashset_elem Doxygen says "slot" but it's an element nodesrc/libbpfilter/include/bpfilter/core/hashset.h:42 — The @brief says "A single slot in the hashset's backing array" but bf_hashset_elem is not a slot. The slots array holds bf_hashset_elem * pointers; the struct itself is a separately-allocated element node with data plus linked-list pointers.

Nits

  • Commit 1 body describes "what" not "why" — Body is "A simple data hashing function. C stdlib lacks one." which restates the title. Should explain why FNV-1a was chosen or why a hash function is needed.
  • Commit 4 scope mismatch — Commit lib: set: use bf_hashset for elems field modifies files under src/bfcli/, src/bpfilter/cgen/, and tests/ but uses the lib: component prefix. Project history uses comma-separated prefixes when multiple components are touched.
  • Use bitmask instead of modulo for power-of-two capacity (dismissed) — Author prefers keeping % to avoid breakage if capacity stops being power-of-two.
  • Three getter declarations packed without blank line separators — Fixed: each getter now has a full Doxygen docblock acting as separator.
  • bf_hashset_foreach cannot be nested — Fixed: current implementation uses only the user-provided elem_var name with no hidden internal variables, so nesting with different variable names works correctly.
  • core/vector.* entries not grouped with other core/ entries in CMakeLists.txt — Fixed: all core/ entries now grouped together in both src/libbpfilter/CMakeLists.txt and tests/unit/CMakeLists.txt.
  • Commit 2 scope mismatch (dismissed) — The lib: core: move bf_list into subfolder commit is no longer part of this PR.
  • Markdown bold in Doxygen — Fixed: **not** no longer appears in the current code.
  • Line length in vector.h comment — Fixed: no line exceeds 80 characters.
  • Commit 4 title uses vague "some getters" — Fixed: title is now "lib: set: use bf_hashset for elems field".
  • Unnecessary (void **) cast on calloc — Fixed: the (void **) cast on calloc in _bf_hashset_grow has been replaced with (bf_hashset_elem *).
  • Mid-function variable declaration in _bf_hashset_grow — Fixed: all variables now declared at top of function.
  • _bf_hashset_slot_is_live takes void * instead of const void * — Fixed: now takes const void *.
  • bf_hashset_ops and bf_hashset structs lack @brief Doxygen — Fixed: both now have @brief doc blocks.
  • bf_hashset_init doc says "stack-allocated" — Fixed: now reads "Initialise a hashset in place".
  • BF_FNV1A_PRIME exposed in public header (dismissed) — Author wants constants together.
  • Copyright year missing in new files (dismissed) — Project moving to yearless model.
  • bf_fnv1a with len = 0 edge case not tested — Fixed: test at helper.c:371 now asserts bf_fnv1a(&val_a, 0, BF_FNV1A_INIT) == BF_FNV1A_INIT.
  • Mid-block variable declaration in fnv1a_hash test — Fixed: all variables are now declared at the top of the block before any executable statements.
  • Missing explicit #include <errno.h> in hashset.c — Fixed: #include <errno.h> is now present at hashset.c:8.
  • bf_hashset_default macro missing Doxygen — Fixed: the macro now has a full @brief, @param, and @return Doxygen block.
  • Commit 3 scope mismatch — Commit lib: core: add bf_hashset (6cca447) also modifies tests/unit/libbpfilter/core/hashset.c and .clang-format. Per project convention, the prefix should include all affected components: lib,tests: core: add bf_hashset.
  • Multi-line comment uses // style — Fixed: load factor comment at hashset.c:16 now uses /* */ style.
  • _dump_n naming convention — Fixed: variable renamed to dump_idx.
  • bf_hashset_elem.data doc incorrectly mentions tombstone and NULLsrc/libbpfilter/include/bpfilter/core/hashset.h:46 — The inline comment says "Data pointer, tombstone, or NULL" but elem->data never holds a tombstone. Tombstones are sentinel values at the slots[] level, not inside the data field. Suggest: "User-provided data pointer."
  • Unnecessary calloc cast in _bf_hashset_resizesrc/libbpfilter/core/hashset.c:48calloc return is cast to (bf_hashset_elem **), but the other two allocation sites in this file and the wider codebase never cast malloc/calloc returns.
  • _bf_hashset_resize missing assert(set)src/libbpfilter/core/hashset.c:44 — Only static function in this file without assert(set) at the top. All other static helpers have it.

CLAUDE.md improvements

  • Style guide says "Use backticks to reference function, variable, and parameter names" in Doxygen, but @c is used in new code at core/hashset.h (7 occurrences). The existing codebase has ~16 @c uses and ~203 backtick uses — backticks are the dominant convention. Consider adding an explicit note in CLAUDE.md about preferring backticks over @c and @ref for inline code references.
  • The codebase is inconsistent about __attribute__((cleanup(...))) vs __attribute__((__cleanup__(...))) — the new core/hashset.h uses the non-underscored form (matching core/list.h), while the majority of other headers use the double-underscore form. Consider documenting the preferred form in CLAUDE.md.

Resolved from prior review

  • FNV-1a hash seeded with 0 instead of BF_FNV1A_INIT — Fixed: all calls now use BF_FNV1A_INIT.
  • bf_hashset_add ownership semantics ambiguous for duplicates — Fixed: doc now clarifies -EEXIST return and ownership.
  • bf_set_dump lost bf_dump_prefix_last call for last element — Fixed: uses _dump_n counter.
  • Duplicated slot-liveness check between header and implementation — Resolved: inline removed from header.
  • _BF_HASHSET_TOMBSTONE exposed in public header — Resolved: constant moved to .c file; bf_hashset_slot_is_tombstone() function used in macro.
  • Memory leak in bf_set_add_elem when bf_hashset_add fails during grow — Re-examined: _cleanup_free_ on _elem handles the failure path correctly. TAKE_PTR runs only on success. No leak.
  • Variable declared after executable statement in bf_set_dump — Re-examined: _dump_n is at the top of the function, before assert(). Placement is correct.
  • assert() on non-pointer value in bf_vector_new — Re-examined: code uses if (!elem_size) return -EINVAL, not assert. Correct per style guide.
  • bf_hashset_init ctx references elem_size before computation — Re-examined: bf_hashset_init just stores the pointer. elem_size is fully computed before any hash/equal calls. Safe.
  • Serialization order now hash-table-dependent — Re-examined: pack/unpack is order-independent. bft_set_eq now uses bf_hashset_contains instead of ordered comparison. No regression.
  • bf_set_get_key_comp lacks Doxygen — Dismissed: bf_set_key is a trivial getter; style guide permits skipping Doxygen for these.
  • Redundant bf_hashset_contains check in bf_set_add_elem — Dismissed: the pre-check avoids an unnecessary malloc + memcpy + free cycle when the element already exists. Intentional optimization.

Workflow run

@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 314af18 to 02938a3 Compare March 12, 2026 23:56
Copy link
Copy Markdown
Contributor

@qdeslandes qdeslandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, first pass of review and a few things to fix. I'll do a second, deeper pass, when those are solved. That being said, it's a very welcome addition! :D

@qdeslandes qdeslandes mentioned this pull request Mar 13, 2026
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 02938a3 to 1aae907 Compare March 13, 2026 15:07
@pzmarzly pzmarzly changed the title lib: create data_structures directory, implement hashset data structure lib: create directory for data structures, implement hashset data structure Mar 13, 2026
@pzmarzly pzmarzly changed the title lib: create directory for data structures, implement hashset data structure lib: create directory for data structures, implement hashset Mar 13, 2026
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 1aae907 to 550c254 Compare March 13, 2026 15:46
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 550c254 to d6883e1 Compare March 16, 2026 22:13
Copy link
Copy Markdown
Contributor

@yaakov-stein yaakov-stein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super neat implementation! Haven't seen an insertion order hashset like this implemented before. First two commits and last commit LGTM. Have a few nits and some questions on hashset.h/c but I think this is pretty much ready. (Haven't reviewed the tests yet, will do that tomorrow)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we keep insertion order, should we add a simple bf_hashset_take?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should it give us, bf_hashset_elem*? I'm not sure how useful that is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have bf_hashset_remove, which already takes ownership of a single element of data. I was thinking it would be more along the lines of bf_dynbuf_take (where it takes all of the data), but I think we can wait to see if Quentin thinks that'd be helpful

@pzmarzly pzmarzly changed the title lib: create directory for data structures, implement hashset lib: implement hashset Mar 25, 2026
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 8c9959f to 0e7595e Compare March 25, 2026 22:35
if (!_bf_hashset_find(set, data, &idx))
return -ENOENT;

slot = &set->slots[idx];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: In bf_hashset_remove, after set->ops.free(&slot->data, ...) runs, the data parameter passed to this function may be a dangling pointer if the caller passed elem->data from a bf_hashset_foreach loop (as the foreach_remove test does). The code is currently correct because data is not used after the free callback runs, but this is fragile — a future change that reads data after this point would introduce a use-after-free.

Consider adding a comment:

    // Note: the free callback may invalidate the `data` parameter if it
    // points into the same allocation as slot->data. Do not use `data`
    // after this point.
    if (set->ops.free)
        set->ops.free(&slot->data, set->ctx);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bf_list_foreach(list, elem) bf_list_delete(elem) has the same problem. I think that after you call a function called *_delete, you should expect the element to be unusable.

A simple data hashing function. C stdlib lacks one.
@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from 0e7595e to ef58a62 Compare March 26, 2026 14:26

if (data == _BF_HASHSET_TOMBSTONE) {
return bf_err_r(
-EINVAL, "data pointer collides with internal tombstone sentinel");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: bf_hashset_add probes the table twice for every non-duplicate insertion: once via bf_hashset_contains (which calls _bf_hashset_find), then again in the for loop at line 228 to locate a free slot. Both walks start from the same hash index and traverse the same probe chain.

Consider extending _bf_hashset_find to also return the index of the first tombstone or empty slot it encounters during probing, so bf_hashset_add can reuse that index and skip the second scan.

grown.cap = new_cap;

bf_hashset_foreach (set, elem) {
r = bf_hashset_add(&grown, elem->data);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: During _bf_hashset_grow, bf_hashset_add(&grown, elem->data) performs a bf_hashset_contains duplicate check and a _bf_hashset_needs_grow load-factor check on every re-inserted element. Both are redundant during rehashing: elements are known unique (they come from the original set), and the new table was just allocated at 2 * old_cap so the load factor cannot trigger a nested grow.

Consider a private _bf_hashset_insert_unchecked helper that skips both checks to avoid the overhead on every rehash.

@pzmarzly
Copy link
Copy Markdown
Contributor Author

pzmarzly commented Mar 26, 2026

I let Claude search for optimizations overnight, and it proposed a different representation that has much better cache locality - bf_hashset_elem** elems instead of bf_hashset_elem* elems, i.e. each node is individually allocated, like in bf_list. It makes reads ~40% faster on large sets. Big change coming.

@pzmarzly pzmarzly force-pushed the push-mlqkpqspnupl branch from ef58a62 to a34983a Compare March 26, 2026 15:38
#define _BF_HASHSET_INIT_CAP 16
/* Maximum load factor before growing. Lowering this number reduces collisions
* but causes higher memory usage. */
#define _BF_HASHSET_MAX_LOAD_NUM 5
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: must-fix: The load factor threshold implemented here is 50%, not 70% as stated in the PR description.

_BF_HASHSET_MAX_LOAD_NUM = 5 and _BF_HASHSET_MAX_LOAD_DEN = 10 produce the check slots_in_use * 10 >= cap * 5, which triggers growth when occupancy reaches 50%. The PR description says "grows 2x when load factor exceeds 70%".

If 70% is intended, the constants should be:

#define _BF_HASHSET_MAX_LOAD_NUM 7
#define _BF_HASHSET_MAX_LOAD_DEN 10

This also affects bf_hashset_reserve (line 206), which currently computes count * 10 / 5 = 2 * count slots instead of the count * 10 / 7 ~ 1.43 * count that a 70% threshold would require.

If 50% is intentional, the PR description should be corrected.

if (count == 0)
return 0;

needed = count * _BF_HASHSET_MAX_LOAD_DEN / _BF_HASHSET_MAX_LOAD_NUM;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: count * _BF_HASHSET_MAX_LOAD_DEN can overflow size_t when count is large (greater than SIZE_MAX / 10). For example, count = SIZE_MAX / 10 + 1 wraps needed to a small value, causing the function to silently return 0 without reserving capacity.

A simple guard before the multiplication would prevent this:

if (count > SIZE_MAX / _BF_HASHSET_MAX_LOAD_DEN)
    return -ENOMEM;

In practice the packet-filtering use case won't hit this, but since bf_hashset_reserve is a public API in a generic container, the guard would make the contract more robust.

} bf_hashset_ops;

/**
* @brief A single slot in the hashset's backing array.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: The @brief says "A single slot in the hashset's backing array" but bf_hashset_elem is not a slot. The slots array holds bf_hashset_elem * pointers -- each pointer can be NULL, the tombstone sentinel, or point to a heap-allocated bf_hashset_elem. The struct itself is a separately-allocated element node containing data plus linked-list pointers for insertion-order traversal.

Suggested: @brief Element node stored in a hashset, threaded through an insertion-order linked list.

*/
typedef struct bf_hashset_elem
{
/// Data pointer, tombstone, or NULL.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: The inline comment says "Data pointer, tombstone, or NULL" but elem->data never holds a tombstone. Tombstones are sentinel values stored at the slots[] level (where each slot is a bf_hashset_elem *), not inside the data field of a live element. A bf_hashset_elem is only allocated for live entries.

Suggested: /// User-provided data pointer.

{
bf_hashset_elem **new_slots;

new_slots = (bf_hashset_elem **)calloc(new_cap, sizeof(*new_slots));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: The calloc return is cast to (bf_hashset_elem **), but the two other allocation sites in this file (lines 129, 242) and the wider codebase (e.g. core/list.c, chain.c, cgen.c) never cast malloc/calloc returns. In C, the void * return converts implicitly. Drop the cast for consistency:

new_slots = calloc(new_cap, sizeof(*new_slots));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this as well

return set->ops.hash(data, set->ctx) % set->cap;
}

static int _bf_hashset_resize(bf_hashset *set, size_t new_cap)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: _bf_hashset_resize is the only static function in this file without an assert(set) at the top. The other static helpers (_bf_hashset_index, _bf_hashset_grow, _bf_hashset_needs_grow, _bf_hashset_find) all start with assert(set). Adding it here would be consistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this

Copy link
Copy Markdown
Contributor

@yaakov-stein yaakov-stein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude has a few valid nits/suggestions and I have one last comment on the tests. Overall LGTM once those points are taken care of!

/**
* @brief Pre-allocate capacity for at least @p count elements.
*
* Ensures the backing array is large enough to hold @p count elements without
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: backticks a few times in this doxygen

* @return 0 on success, -ENOENT if the element is not found, or a negative
* errno value on failure.
*/
int bf_hashset_delete(bf_hashset *set, const void *data);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering - why delete instead of remove?

return set->ops.hash(data, set->ctx) % set->cap;
}

static int _bf_hashset_resize(bf_hashset *set, size_t new_cap)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this

{
bf_hashset_elem **new_slots;

new_slots = (bf_hashset_elem **)calloc(new_cap, sizeof(*new_slots));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this as well

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the tests, I think there's a couple areas that are especially complex that we'd want to test (some of which you already have covered):

  • Ensure insertion order is maintained across a grow operation (already covered)
  • Validate that this works for several elements that collide, i.e. their hashes are equal. On top of this, it'd be worth testing that removing some of those colliding elements and creating tombstones within that linear probe doesn't cause any buggy behavior. (I don't believe these cases are covered)

@yaakov-stein
Copy link
Copy Markdown
Contributor

I let Claude search for optimizations overnight, and it proposed a different representation that has much better cache locality - bf_hashset_elem** elems instead of bf_hashset_elem* elems, i.e. each node is individually allocated, like in bf_list. It makes reads ~40% faster on large sets. Big change coming.

I'm not necessarily opposed to the change as it simplifies some parts of the code, but I'm confused by the claim that this has better cache locality - shouldn't the cache locality here be much worse? Whenever we need to check for equality we need to load a non-contiguous piece of memory. We also can't take advantage of the spatial locality anymore. Can you explain to me what cases you saw ~40% speedup on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use hashset data structure in bf_set

3 participants