lib: implement hashset by pzmarzly · Pull Request #470 · facebook/bpfilter

pzmarzly · 2026-03-12T14:59:59Z

Currently, bf_set uses bf_list under the hood. This is suboptimal performance-wise. Here I am changing bf_set to be backed by bf_hashset.

lib: helper: add FNV-1a hash function - imported from another PR, a simple hash function.
lib: add bf_hashset - main commit. Implements a simple hashset (tombstones on deletion, grows 2x when load factor exceeds 70%, never shrinks, linear probing on collision). See https://en.wikipedia.org/wiki/Linear_probing (with "special flag value" strategy in "Deletion" section).
lib: set: use bf_hashset for elems field - changes bf_set's inner structure

See #460 for previous attempt.

Fixes #418

src/libbpfilter/set.c

tests/unit/libbpfilter/data_structures/hashset.c

src/libbpfilter/include/bpfilter/data_structures/hashset.h

src/libbpfilter/set.c

src/libbpfilter/include/bpfilter/data_structures/hashset.h

github-actions · 2026-03-12T15:12:19Z

Claude review of PR #470 (`a34983a`)

Must fix

~~foreach_break test asserts on unmodified variable~~ — Fixed: count is now incremented before break, so count == 1 is correct.
~~No test for bf_hashset_take~~ — Fixed: take test now in tests/unit/libbpfilter/core/hashset.c:272.
~~void * pointer arithmetic in bf_vector_foreach will break pedantic C17 integration test~~ (dismissed) — Maintainer explicitly requested not using char * casts; both maintainer and author agreed this is a future topic.
~~Uninitialized old_slots in _bf_hashset_grow causes heap corruption on OOM~~ — Fixed: old_slots is now a plain void ** without cleanup attribute (bf_vector removed from PR). Assigned before use.
~~Unchecked bf_hashset_add return in _bf_hashset_grow~~ — Fixed: return value is now checked with if (r) { free(old_slots); return r; }. See new suggestion below about the error path's element-loss subtlety.
~~Missing <assert.h> in vector.h~~ — Fixed: vector.h was removed from this PR (bf_vector split into a separate PR per reviewer feedback).
Load factor threshold is 50%, not 70% as described — src/libbpfilter/core/hashset.c:18 — _BF_HASHSET_MAX_LOAD_NUM=5 / _BF_HASHSET_MAX_LOAD_DEN=10 triggers growth at 50% occupancy, not 70% as stated in the PR description. Either update the constants to 7/10 for 70%, or correct the documentation. This also affects bf_hashset_reserve which computes count * 2 slots instead of count * 1.43.

Suggestions

Nits

CLAUDE.md improvements

Style guide says "Use backticks to reference function, variable, and parameter names" in Doxygen, but @c is used in new code at core/hashset.h (7 occurrences). The existing codebase has ~16 @c uses and ~203 backtick uses — backticks are the dominant convention. Consider adding an explicit note in CLAUDE.md about preferring backticks over @c and @ref for inline code references.
The codebase is inconsistent about __attribute__((cleanup(...))) vs __attribute__((__cleanup__(...))) — the new core/hashset.h uses the non-underscored form (matching core/list.h), while the majority of other headers use the double-underscore form. Consider documenting the preferred form in CLAUDE.md.

Resolved from prior review

Workflow run

src/libbpfilter/set.c

src/libbpfilter/data_structures/vector.c

src/libbpfilter/include/bpfilter/core/hashset.h

qdeslandes

Alright, first pass of review and a few things to fix. I'll do a second, deeper pass, when those are solved. That being said, it's a very welcome addition! :D

src/libbpfilter/include/bpfilter/helper.h

src/libbpfilter/helper.c

src/bfcli/chain.c

src/libbpfilter/core/vector.c

src/libbpfilter/data_structures/vector.c

src/libbpfilter/include/bpfilter/data_structures/hashset.h

src/libbpfilter/include/bpfilter/core/hashset.h

tests/unit/libbpfilter/data_structures/hashset.c

src/libbpfilter/include/bpfilter/core/hashset.h

src/libbpfilter/core/hashset.c

src/libbpfilter/set.c

src/libbpfilter/include/bpfilter/core/hashset.h

src/libbpfilter/set.c

src/libbpfilter/core/hashset.c

src/libbpfilter/include/bpfilter/core/hashset.h

tests/unit/libbpfilter/core/hashset.c

src/libbpfilter/include/bpfilter/core/vector.h

src/libbpfilter/core/hashset.c

src/libbpfilter/include/bpfilter/helper.h

src/libbpfilter/core/hashset.c

yaakov-stein

Super neat implementation! Haven't seen an insertion order hashset like this implemented before. First two commits and last commit LGTM. Have a few nits and some questions on hashset.h/c but I think this is pretty much ready. (Haven't reviewed the tests yet, will do that tomorrow)

src/libbpfilter/include/bpfilter/core/hashset.h

yaakov-stein · 2026-03-24T23:17:44Z

src/libbpfilter/include/bpfilter/core/hashset.h

Now that we keep insertion order, should we add a simple bf_hashset_take?

What should it give us, bf_hashset_elem*? I'm not sure how useful that is.

We have bf_hashset_remove, which already takes ownership of a single element of data. I was thinking it would be more along the lines of bf_dynbuf_take (where it takes all of the data), but I think we can wait to see if Quentin thinks that'd be helpful

src/libbpfilter/core/hashset.c

src/libbpfilter/include/bpfilter/core/hashset.h

github-actions · 2026-03-25T22:51:35Z

src/libbpfilter/core/hashset.c

+    if (!_bf_hashset_find(set, data, &idx))
+        return -ENOENT;
+
+    slot = &set->slots[idx];


Claude: suggestion: In bf_hashset_remove, after set->ops.free(&slot->data, ...) runs, the data parameter passed to this function may be a dangling pointer if the caller passed elem->data from a bf_hashset_foreach loop (as the foreach_remove test does). The code is currently correct because data is not used after the free callback runs, but this is fragile — a future change that reads data after this point would introduce a use-after-free.

Consider adding a comment:

// Note: the free callback may invalidate the `data` parameter if it // points into the same allocation as slot->data. Do not use `data` // after this point. if (set->ops.free) set->ops.free(&slot->data, set->ctx);

bf_list_foreach(list, elem) bf_list_delete(elem) has the same problem. I think that after you call a function called *_delete, you should expect the element to be unusable.

tests/harness/test.c

A simple data hashing function. C stdlib lacks one.

github-actions · 2026-03-26T14:49:13Z

src/libbpfilter/core/hashset.c

+
+    if (data == _BF_HASHSET_TOMBSTONE) {
+        return bf_err_r(
+            -EINVAL, "data pointer collides with internal tombstone sentinel");


Claude: suggestion: bf_hashset_add probes the table twice for every non-duplicate insertion: once via bf_hashset_contains (which calls _bf_hashset_find), then again in the for loop at line 228 to locate a free slot. Both walks start from the same hash index and traverse the same probe chain.

Consider extending _bf_hashset_find to also return the index of the first tombstone or empty slot it encounters during probing, so bf_hashset_add can reuse that index and skip the second scan.

github-actions · 2026-03-26T14:49:14Z

src/libbpfilter/core/hashset.c

+    grown.cap = new_cap;
+
+    bf_hashset_foreach (set, elem) {
+        r = bf_hashset_add(&grown, elem->data);


Claude: suggestion: During _bf_hashset_grow, bf_hashset_add(&grown, elem->data) performs a bf_hashset_contains duplicate check and a _bf_hashset_needs_grow load-factor check on every re-inserted element. Both are redundant during rehashing: elements are known unique (they come from the original set), and the new table was just allocated at 2 * old_cap so the load factor cannot trigger a nested grow.

Consider a private _bf_hashset_insert_unchecked helper that skips both checks to avoid the overhead on every rehash.

src/libbpfilter/set.c

src/libbpfilter/core/hashset.c

src/libbpfilter/set.c

pzmarzly · 2026-03-26T15:27:19Z

I let Claude search for optimizations overnight, and it proposed a different representation that has much better cache locality - bf_hashset_elem** elems instead of bf_hashset_elem* elems, i.e. each node is individually allocated, like in bf_list. It makes reads ~40% faster on large sets. Big change coming.

github-actions · 2026-03-26T15:58:45Z

src/libbpfilter/core/hashset.c

+#define _BF_HASHSET_INIT_CAP 16
+/* Maximum load factor before growing. Lowering this number reduces collisions
+ * but causes higher memory usage. */
+#define _BF_HASHSET_MAX_LOAD_NUM 5


Claude: must-fix: The load factor threshold implemented here is 50%, not 70% as stated in the PR description.

_BF_HASHSET_MAX_LOAD_NUM = 5 and _BF_HASHSET_MAX_LOAD_DEN = 10 produce the check slots_in_use * 10 >= cap * 5, which triggers growth when occupancy reaches 50%. The PR description says "grows 2x when load factor exceeds 70%".

If 70% is intended, the constants should be:

#define _BF_HASHSET_MAX_LOAD_NUM 7 #define _BF_HASHSET_MAX_LOAD_DEN 10

This also affects bf_hashset_reserve (line 206), which currently computes count * 10 / 5 = 2 * count slots instead of the count * 10 / 7 ~ 1.43 * count that a 70% threshold would require.

If 50% is intentional, the PR description should be corrected.

github-actions · 2026-03-26T15:58:47Z

src/libbpfilter/core/hashset.c

+    if (count == 0)
+        return 0;
+
+    needed = count * _BF_HASHSET_MAX_LOAD_DEN / _BF_HASHSET_MAX_LOAD_NUM;


Claude: suggestion: count * _BF_HASHSET_MAX_LOAD_DEN can overflow size_t when count is large (greater than SIZE_MAX / 10). For example, count = SIZE_MAX / 10 + 1 wraps needed to a small value, causing the function to silently return 0 without reserving capacity.

A simple guard before the multiplication would prevent this:

if (count > SIZE_MAX / _BF_HASHSET_MAX_LOAD_DEN) return -ENOMEM;

In practice the packet-filtering use case won't hit this, but since bf_hashset_reserve is a public API in a generic container, the guard would make the contract more robust.

github-actions · 2026-03-26T15:58:48Z

src/libbpfilter/include/bpfilter/core/hashset.h

+} bf_hashset_ops;
+
+/**
+ * @brief A single slot in the hashset's backing array.


Claude: suggestion: The @brief says "A single slot in the hashset's backing array" but bf_hashset_elem is not a slot. The slots array holds bf_hashset_elem * pointers -- each pointer can be NULL, the tombstone sentinel, or point to a heap-allocated bf_hashset_elem. The struct itself is a separately-allocated element node containing data plus linked-list pointers for insertion-order traversal.

Suggested: @brief Element node stored in a hashset, threaded through an insertion-order linked list.

github-actions · 2026-03-26T15:58:49Z

src/libbpfilter/include/bpfilter/core/hashset.h

+ */
+typedef struct bf_hashset_elem
+{
+    /// Data pointer, tombstone, or NULL.


Claude: nit: The inline comment says "Data pointer, tombstone, or NULL" but elem->data never holds a tombstone. Tombstones are sentinel values stored at the slots[] level (where each slot is a bf_hashset_elem *), not inside the data field of a live element. A bf_hashset_elem is only allocated for live entries.

Suggested: /// User-provided data pointer.

github-actions · 2026-03-26T15:58:49Z

src/libbpfilter/core/hashset.c

+{
+    bf_hashset_elem **new_slots;
+
+    new_slots = (bf_hashset_elem **)calloc(new_cap, sizeof(*new_slots));


Claude: nit: The calloc return is cast to (bf_hashset_elem **), but the two other allocation sites in this file (lines 129, 242) and the wider codebase (e.g. core/list.c, chain.c, cgen.c) never cast malloc/calloc returns. In C, the void * return converts implicitly. Drop the cast for consistency:

new_slots = calloc(new_cap, sizeof(*new_slots));

+1 to this as well

github-actions · 2026-03-26T15:58:50Z

src/libbpfilter/core/hashset.c

+    return set->ops.hash(data, set->ctx) % set->cap;
+}
+
+static int _bf_hashset_resize(bf_hashset *set, size_t new_cap)


Claude: nit: _bf_hashset_resize is the only static function in this file without an assert(set) at the top. The other static helpers (_bf_hashset_index, _bf_hashset_grow, _bf_hashset_needs_grow, _bf_hashset_find) all start with assert(set). Adding it here would be consistent.

yaakov-stein

Claude has a few valid nits/suggestions and I have one last comment on the tests. Overall LGTM once those points are taken care of!

yaakov-stein · 2026-03-26T17:03:01Z

src/libbpfilter/include/bpfilter/core/hashset.h

+/**
+ * @brief Pre-allocate capacity for at least @p count elements.
+ *
+ * Ensures the backing array is large enough to hold @p count elements without


nit: backticks a few times in this doxygen

yaakov-stein · 2026-03-26T17:03:28Z

src/libbpfilter/include/bpfilter/core/hashset.h

+ * @return 0 on success, -ENOENT if the element is not found, or a negative
+ *  errno value on failure.
+ */
+int bf_hashset_delete(bf_hashset *set, const void *data);


Just wondering - why delete instead of remove?

yaakov-stein · 2026-03-26T17:11:36Z

src/libbpfilter/core/hashset.c

+    return set->ops.hash(data, set->ctx) % set->cap;
+}
+
+static int _bf_hashset_resize(bf_hashset *set, size_t new_cap)


yaakov-stein · 2026-03-26T17:13:03Z

src/libbpfilter/core/hashset.c

+{
+    bf_hashset_elem **new_slots;
+
+    new_slots = (bf_hashset_elem **)calloc(new_cap, sizeof(*new_slots));


+1 to this as well

yaakov-stein · 2026-03-26T17:32:25Z

tests/unit/libbpfilter/core/hashset.c

Regarding the tests, I think there's a couple areas that are especially complex that we'd want to test (some of which you already have covered):

Ensure insertion order is maintained across a grow operation (already covered)

Validate that this works for several elements that collide, i.e. their hashes are equal. On top of this, it'd be worth testing that removing some of those colliding elements and creating tombstones within that linear probe doesn't cause any buggy behavior. (I don't believe these cases are covered)

yaakov-stein · 2026-03-26T18:28:02Z

I let Claude search for optimizations overnight, and it proposed a different representation that has much better cache locality - bf_hashset_elem** elems instead of bf_hashset_elem* elems, i.e. each node is individually allocated, like in bf_list. It makes reads ~40% faster on large sets. Big change coming.

I'm not necessarily opposed to the change as it simplifies some parts of the code, but I'm confused by the claim that this has better cache locality - shouldn't the cache locality here be much worse? Whenever we need to check for equality we need to load a non-contiguous piece of memory. We also can't take advantage of the spatial locality anymore. Can you explain to me what cases you saw ~40% speedup on?

pzmarzly requested a review from qdeslandes as a code owner March 12, 2026 15:00

pzmarzly marked this pull request as draft March 12, 2026 15:00

meta-cla bot added the cla signed label Mar 12, 2026

github-actions bot reviewed Mar 12, 2026

View reviewed changes

src/libbpfilter/set.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 12, 2026

View reviewed changes

tests/unit/libbpfilter/data_structures/hashset.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 12, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/data_structures/hashset.h Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 12, 2026

View reviewed changes

src/libbpfilter/set.c Show resolved Hide resolved

github-actions bot reviewed Mar 12, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/data_structures/hashset.h Outdated Show resolved Hide resolved

pzmarzly force-pushed the push-mlqkpqspnupl branch from 314af18 to 02938a3 Compare March 12, 2026 23:56

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/set.c Show resolved Hide resolved

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/data_structures/vector.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/core/hashset.h Show resolved Hide resolved

qdeslandes requested changes Mar 13, 2026

View reviewed changes

qdeslandes mentioned this pull request Mar 13, 2026

lib: add bf_hashset #460

Closed

pzmarzly mentioned this pull request Mar 13, 2026

Replace bf_dynbuf with bf_vector #473

Open

pzmarzly force-pushed the push-mlqkpqspnupl branch from 02938a3 to 1aae907 Compare March 13, 2026 15:07

pzmarzly changed the title ~~lib: create data_structures directory, implement hashset data structure~~ lib: create directory for data structures, implement hashset data structure Mar 13, 2026

pzmarzly changed the title ~~lib: create directory for data structures, implement hashset data structure~~ lib: create directory for data structures, implement hashset Mar 13, 2026

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/set.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/set.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/core/hashset.h Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/set.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/core/hashset.c Outdated Show resolved Hide resolved

pzmarzly force-pushed the push-mlqkpqspnupl branch from 1aae907 to 550c254 Compare March 13, 2026 15:46

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/core/hashset.h Show resolved Hide resolved

github-actions bot reviewed Mar 13, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/core/hashset.h Show resolved Hide resolved

pzmarzly force-pushed the push-mlqkpqspnupl branch from 550c254 to d6883e1 Compare March 16, 2026 22:13

github-actions bot reviewed Mar 16, 2026

View reviewed changes

tests/unit/libbpfilter/core/hashset.c Show resolved Hide resolved

github-actions bot reviewed Mar 16, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/core/vector.h Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 23, 2026

View reviewed changes

src/libbpfilter/core/hashset.c Show resolved Hide resolved

github-actions bot reviewed Mar 23, 2026

View reviewed changes

src/libbpfilter/include/bpfilter/helper.h Show resolved Hide resolved

github-actions bot reviewed Mar 23, 2026

View reviewed changes

src/libbpfilter/core/hashset.c Show resolved Hide resolved

yaakov-stein self-requested a review March 23, 2026 20:13

pzmarzly mentioned this pull request Mar 24, 2026

lib: add bf_vector and use it in cgen #487

Open

yaakov-stein requested changes Mar 25, 2026

View reviewed changes

pzmarzly changed the title ~~lib: create directory for data structures, implement hashset~~ lib: implement hashset Mar 25, 2026

pzmarzly force-pushed the push-mlqkpqspnupl branch from 8c9959f to 0e7595e Compare March 25, 2026 22:35

github-actions bot reviewed Mar 25, 2026

View reviewed changes

tests/harness/test.c Show resolved Hide resolved

lib: helper: add FNV-1a hash function

88e6451

A simple data hashing function. C stdlib lacks one.

pzmarzly force-pushed the push-mlqkpqspnupl branch from 0e7595e to ef58a62 Compare March 26, 2026 14:26

github-actions bot reviewed Mar 26, 2026

View reviewed changes

src/libbpfilter/set.c Show resolved Hide resolved

github-actions bot reviewed Mar 26, 2026

View reviewed changes

src/libbpfilter/core/hashset.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 26, 2026

View reviewed changes

src/libbpfilter/core/hashset.c Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 26, 2026

View reviewed changes

src/libbpfilter/set.c Outdated Show resolved Hide resolved

pzmarzly added 2 commits March 26, 2026 15:36

lib: core: add bf_hashset

999c89c

lib: set: use bf_hashset for elems field

a34983a

pzmarzly force-pushed the push-mlqkpqspnupl branch from ef58a62 to a34983a Compare March 26, 2026 15:38

github-actions bot reviewed Mar 26, 2026

View reviewed changes

yaakov-stein reviewed Mar 26, 2026

View reviewed changes

Conversation

pzmarzly commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude review of PR #470 (a34983a)

Must fix

Suggestions

Nits

CLAUDE.md improvements

Resolved from prior review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qdeslandes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaakov-stein left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzmarzly commented Mar 12, 2026 •

edited

Loading

github-actions bot commented Mar 12, 2026 •

edited

Loading

Claude review of PR #470 (`a34983a`)

yaakov-stein left a comment •

edited

Loading

pzmarzly commented Mar 26, 2026 •

edited

Loading