Skip to content

fix(serialization): use HashingTag so BSL metadata contributes to content hash#264

Merged
hussainsultan merged 1 commit into
boringdata:mainfrom
ghoersti:george/fix/tag-to-hashing-tag
May 19, 2026
Merged

fix(serialization): use HashingTag so BSL metadata contributes to content hash#264
hussainsultan merged 1 commit into
boringdata:mainfrom
ghoersti:george/fix/tag-to-hashing-tag

Conversation

@ghoersti
Copy link
Copy Markdown
Contributor

Closes #263

Summary

to_tagged() wraps BSL expressions in a tag node so xorq build/catalog tooling can recover the BSL metadata. Previously it used Table.tag(...), which produces a plain Tag op. xorq's hash machinery (opaque_node_replacer in dask_normalize_expr) is transparent to plain Tag — only HashingTag participates in the content hash. As a result, source and source.tag("bsl", **metadata) produced identical content hashes, so two xorq build invocations — one for a raw source, one for a BSL model on top of the same source — silently overwrote each other under builds/<hash>/.

Switching to Table.hashing_tag(...) makes BSL metadata participate in the content hash. HashingTag is a Tag subclass, so existing isinstance(op, Tag) checks in serialization/reconstruct.py continue to work unchanged.

Change

src/boring_semantic_layer/serialization/__init__.py

- xorq_table = xorq_table.tag(tag="bsl", **tag_data)
+ xorq_table = xorq_table.hashing_tag(tag="bsl", **tag_data)

Tests

Four new regression tests in tests/test_xorq_tag_handler.py pin the contract:

  • test_tagged_op_is_hashing_tag — concrete op type is HashingTag, not bare Tag.
  • test_tagged_hash_differs_from_untagged_sourcedask.base.tokenize of tagged vs. untagged differs.
  • test_tagged_hash_distinguishes_models — two different BSL models on the same ibis table produce different hashes (the user-visible bug from to_tagged() uses Tag instead of HashingTag, causing build hash collisions #263).
  • test_tagged_hash_is_deterministicto_tagged() is stable across calls.

Verified the first three fail on the pre-fix code (hashes match, collision reproduced) and all four pass on the fix. Full suite: 989 passed, 15 skipped, 10 xfailed, 6 xpassed (was 985 — +4 new tests).

Test plan

  • New regression tests fail without the fix and pass with it
  • Full pytest src/boring_semantic_layer/tests/ green
  • to_tagged / from_tagged round-trip unchanged (existing tag-handler tests pass)

Follow-up

A separate PR (george/update/bump-xorq-0-3-24) will bump xorq>=0.3.24 and address the test failures surfaced by that bump (renamed xo.read_parquetxo.deferred_read_parquet and tightened type inference on xorq.vendor.ibis literals).

🤖 Generated with Claude Code

…tent hash

Closes boringdata#263

Both `to_tagged()` and `reemit()` previously wrapped BSL expressions in
a plain Tag node, which xorq's `opaque_node_replacer` strips during
content-hash computation. The result: `source` and
`source.tag("bsl", **metadata)` produced identical hashes, so two
`xorq build` invocations (one for the raw source, one for a BSL model
over it) silently overwrote each other under `builds/<hash>/`. The
same regression applied to rebuilt artifacts coming out of `reemit`.

Switch both call sites to `Table.hashing_tag(...)` so BSL metadata
participates in the hash. HashingTag is a Tag subclass, so existing
`isinstance(op, Tag)` checks in the reconstruct path are unaffected.

Tests:
  - Drop `@pytest.mark.xfail` from the pre-existing
    `test_different_measures_produce_different_hashes` in
    test_xorq_convert.py — that test was xfailed against the bug and
    now passes (uses xorq's authoritative `compute_expr_hash`).
  - Add two new tests in test_xorq_tag_handler.py:
      * test_tagged_op_is_hashing_tag — outer op type is HashingTag.
      * test_tagged_hash_differs_from_untagged_source — tagged hash
        differs from bare source (covers untagged-vs-tagged, no
        pre-existing duplicate).
  - Add two new reemit regression tests:
      * test_reemit_preserves_hashing_tag — pin that `reemit` re-stamps
        with HashingTag, so rebuild paths keep the hash contract.
      * test_reemit_hash_distinguishes_metadata — round-trip via
        `to_tagged → reemit` still produces distinct hashes for distinct
        metadata.

All new tests use `compute_expr_hash` (xorq's content-hash function)
rather than `dask.base.tokenize` for consistency with the existing
tests in test_xorq_convert.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ghoersti ghoersti force-pushed the george/fix/tag-to-hashing-tag branch from 6576c32 to 9beb3e2 Compare May 19, 2026 04:12
@hussainsultan hussainsultan merged commit 68a65d3 into boringdata:main May 19, 2026
3 checks passed
ghoersti pushed a commit to ghoersti/boring-semantic-layer that referenced this pull request May 20, 2026
Patch bump covering join fixes on plain ibis backends (boringdata#222), grain
mismatch on `with_measures` over joins (boringdata#261), the ibis-native calc-measure
classifier refactor (boringdata#262), and the HashingTag serialization fix (boringdata#264).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

to_tagged() uses Tag instead of HashingTag, causing build hash collisions

2 participants