fix(serialization): use HashingTag so BSL metadata contributes to content hash#264
Merged
hussainsultan merged 1 commit intoMay 19, 2026
Merged
Conversation
…tent hash Closes boringdata#263 Both `to_tagged()` and `reemit()` previously wrapped BSL expressions in a plain Tag node, which xorq's `opaque_node_replacer` strips during content-hash computation. The result: `source` and `source.tag("bsl", **metadata)` produced identical hashes, so two `xorq build` invocations (one for the raw source, one for a BSL model over it) silently overwrote each other under `builds/<hash>/`. The same regression applied to rebuilt artifacts coming out of `reemit`. Switch both call sites to `Table.hashing_tag(...)` so BSL metadata participates in the hash. HashingTag is a Tag subclass, so existing `isinstance(op, Tag)` checks in the reconstruct path are unaffected. Tests: - Drop `@pytest.mark.xfail` from the pre-existing `test_different_measures_produce_different_hashes` in test_xorq_convert.py — that test was xfailed against the bug and now passes (uses xorq's authoritative `compute_expr_hash`). - Add two new tests in test_xorq_tag_handler.py: * test_tagged_op_is_hashing_tag — outer op type is HashingTag. * test_tagged_hash_differs_from_untagged_source — tagged hash differs from bare source (covers untagged-vs-tagged, no pre-existing duplicate). - Add two new reemit regression tests: * test_reemit_preserves_hashing_tag — pin that `reemit` re-stamps with HashingTag, so rebuild paths keep the hash contract. * test_reemit_hash_distinguishes_metadata — round-trip via `to_tagged → reemit` still produces distinct hashes for distinct metadata. All new tests use `compute_expr_hash` (xorq's content-hash function) rather than `dask.base.tokenize` for consistency with the existing tests in test_xorq_convert.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6576c32 to
9beb3e2
Compare
2 tasks
ghoersti
pushed a commit
to ghoersti/boring-semantic-layer
that referenced
this pull request
May 20, 2026
Patch bump covering join fixes on plain ibis backends (boringdata#222), grain mismatch on `with_measures` over joins (boringdata#261), the ibis-native calc-measure classifier refactor (boringdata#262), and the HashingTag serialization fix (boringdata#264). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #263
Summary
to_tagged()wraps BSL expressions in a tag node so xorq build/catalog tooling can recover the BSL metadata. Previously it usedTable.tag(...), which produces a plainTagop. xorq's hash machinery (opaque_node_replacerindask_normalize_expr) is transparent to plainTag— onlyHashingTagparticipates in the content hash. As a result,sourceandsource.tag("bsl", **metadata)produced identical content hashes, so twoxorq buildinvocations — one for a raw source, one for a BSL model on top of the same source — silently overwrote each other underbuilds/<hash>/.Switching to
Table.hashing_tag(...)makes BSL metadata participate in the content hash.HashingTagis aTagsubclass, so existingisinstance(op, Tag)checks inserialization/reconstruct.pycontinue to work unchanged.Change
src/boring_semantic_layer/serialization/__init__.pyTests
Four new regression tests in
tests/test_xorq_tag_handler.pypin the contract:test_tagged_op_is_hashing_tag— concrete op type isHashingTag, not bareTag.test_tagged_hash_differs_from_untagged_source—dask.base.tokenizeof tagged vs. untagged differs.test_tagged_hash_distinguishes_models— two different BSL models on the same ibis table produce different hashes (the user-visible bug from to_tagged() uses Tag instead of HashingTag, causing build hash collisions #263).test_tagged_hash_is_deterministic—to_tagged()is stable across calls.Verified the first three fail on the pre-fix code (hashes match, collision reproduced) and all four pass on the fix. Full suite: 989 passed, 15 skipped, 10 xfailed, 6 xpassed (was 985 — +4 new tests).
Test plan
pytest src/boring_semantic_layer/tests/greento_tagged/from_taggedround-trip unchanged (existing tag-handler tests pass)Follow-up
A separate PR (
george/update/bump-xorq-0-3-24) will bumpxorq>=0.3.24and address the test failures surfaced by that bump (renamedxo.read_parquet→xo.deferred_read_parquetand tightened type inference onxorq.vendor.ibisliterals).🤖 Generated with Claude Code