Skip to content

Migrate DISJOIN to a registered expander and solve the star-REPLACE portability limitation — Closes #143, #153#159

Merged
conradbzura merged 1 commit into
mainfrom
143-migrate-disjoin-expander
Jun 29, 2026
Merged

Migrate DISJOIN to a registered expander and solve the star-REPLACE portability limitation — Closes #143, #153#159
conradbzura merged 1 commit into
mainfrom
143-migrate-disjoin-expander

Conversation

@conradbzura

@conradbzura conradbzura commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Migrate DISJOIN from the emit-time BaseGIQLGenerator.giqldisjoin_sql string special-case (left by epic #114) to a generic registered AST expander, and make the full-row passthrough capability-driven so non-canonical encodings produce portable SQL on engines that lack SELECT * REPLACE.

Register one expand_disjoin against GenericTarget, so every target resolves to it through the registry's generic chain. The expander assembles the same __giql_dj_* WITH-CTE subquery, parses it back into a sqlglot expression, and returns that node for the active target's serializer to render — dissolving the emit-time string special-case. A single capability branch on ctx.capabilities.supports_star_replace selects the passthrough projection form: emit t.* REPLACE (...) on a target that supports it (DuckDB), and the portable t.* EXCEPT (start, end), <recomputed start>, <recomputed end> form otherwise, which every * EXCEPT-capable engine plans. The portable branch is what adds DataFusion support for non-canonical DISJOIN passthrough. Input canonicalization stays owned by CanonicalizeCoordinates (pass 2, #122) — the expander consumes already-canonical 0-based half-open columns and only round-trips the output back into the target's declared encoding.

Fix #153 by aliasing all four projected columns (kc / ks / ke / pos) in every __giql_dj_cuts UNION branch. Previously the bare de-canonicalized end column and the end-cut expression collided under one output name in the default 0-based half-open identity case; DuckDB tolerated the duplicate, but DataFusion rejected it as a non-unique projection name. With every branch aliased, the projection is internally unique on strict engines and behaviour-preserving on DuckDB. As a result, the cross-target oracle's previously-pinned _pending_153 expected-failure is promoted to a real three-target identity test.

Part of epic #137 wave 3; carries the shared ExpanderRegistry.snapshot()/restore() seam that sibling wave-3 PRs also have (dedupe on merge).

Closes #143, #153

Proposed changes

Registry save/restore seam (src/giql/expander.py)

Add the public ExpanderRegistry.snapshot() / ExpanderRegistry.restore() methods, first introduced for these fixtures. snapshot() returns a fresh shallow copy of the (target, operator) → expander registrations; restore() drops all current entries and re-installs exactly the snapshot contents. This lets an isolating test fixture (or a plugin) capture the import-time baseline, mutate the process-wide REGISTRY around a body, and hand the baseline back afterward so the built-in expanders survive a fixture that would otherwise clear() them permanently.

giql.expanders package + DISJOIN expander (src/giql/expanders/__init__.py, src/giql/expanders/disjoin.py)

Add the giql.expanders package whose __init__ auto-imports every submodule via pkgutil.iter_modules, so dropping a <operator>.py into the package registers its expander as an import side effect without editing the package file. Add disjoin.py with the @register(GenericTarget, GIQLDisjoin) expander and its helpers (_build_disjoin_sql, _disjoin_passthrough, _disjoin_output_encoding, _disjoin_resolution), carrying over the original resolution-unpacking and historical diagnostics verbatim. The passthrough is the capability-driven form described in the summary; the identity 0-based half-open case stays a plain t.* fast path.

#153 alias fix (isolated in disjoin.py's __giql_dj_cuts assembly)

Alias kc / ks / ke / pos in all three __giql_dj_cuts UNION branches. This is an isolated, cherry-pickable change: it only adds aliases to existing projections and does not depend on the migration or the capability branch.

GIQLDisjoin.GIQL_EXPAND flip + legacy deletion (src/giql/expressions.py, src/giql/generators/base.py, src/giql/transpile.py)

Flip GIQLDisjoin.GIQL_EXPAND from the disabled sentinel to True, so the ExpandOperators pass replaces the node with the expander's AST. Delete BaseGIQLGenerator.giqldisjoin_sql and the DISJOIN-only generator helpers (_disjoin_resolution, _disjoin_passthrough, _disjoin_output_encoding) plus the now-unused GIQLDisjoin import. Wire import giql.expanders in transpile.py so the registry is populated before the first transpile.

Test updates

Update test_disjoin_transpilation.py, test_canonicalizer.py, and test_expander.py for the registry-driven path, and add the capability-passthrough and snapshot()/restore() coverage. Two execute-on-engine harnesses now transpile with the engine dialect: test_usage_patterns.py (_execute) and coordinate_space/conftest.py (giql_query) pass dialect=engine/dialect="duckdb", because a non-canonical DISJOIN passthrough emits * EXCEPT for the generic target and * EXCEPT is not DuckDB-runnable — the SQL must be shaped for the engine it executes on. Promote the cross-target oracle's test_disjoin_on_datafusion_unsupported_pending_153 expected-failure to test_disjoin_agrees_across_all_targets, a real three-target identity test.

Test cases

# Test Suite Given When Then Coverage Target
1 TestDisjoinCanonicalization A self-mode DISJOIN over a 1-based closed target, transpiled for the generic target Transpiling to SQL The passthrough de-canonicalizes via a portable * EXCEPT projection that drops and re-projects the interval columns Portable passthrough on engines without REPLACE
2 TestDisjoinCanonicalization A self-mode DISJOIN over a 1-based closed target, transpiled for the DuckDB target Transpiling to SQL The passthrough de-canonicalizes via a star REPLACE on the final projection REPLACE passthrough on DuckDB
3 TestDisjoinTranspilation A DISJOIN over a registered target Transpiling to SQL Emits a parenthesized WITH-CTE subquery with the disjoin_chrom / disjoin_start / disjoin_end columns Registry-expanded CTE shape
4 TestDisjoinTranspilation A DISJOIN with the reference omitted Transpiling to SQL Defaults the reference to the target set and skips the coverage EXISTS clause Self-reference coverage skip
5 TestDisjoinTranspilation A DISJOIN whose reference is a distinct table or shadowing CTE Transpiling to SQL Emits the coverage EXISTS clause against the reference Coverage filter emission
6 TestDisjoinTranspilation A DISJOIN target or reference using the reserved __giql_dj_ prefix, or an unknown reference name Transpiling to SQL Re-raises DISJOIN's historical diagnostics verbatim Diagnostic parity
7 TestExpanderRegistry A registry with one entry captured by snapshot() A second entry is registered afterward The snapshot holds only the first entry, being a copy not a live view snapshot() independence
8 TestExpanderRegistry A snapshot taken, then the registry cleared and a different entry registered restore() is called with the snapshot The original entry resolves again and the post-snapshot entry is gone restore() semantics
9 TestExpandOperatorsPass A flagged operator with a registered expander The pass transforms the AST Dispatches to the registered expander and replaces the node GIQL_EXPAND dispatch
10 TestNoOpWhenFlagsOff A DISJOIN query with pass 2 bypassed but pass 3 kept Comparing canonicalizer output to the expanded baseline Pass 2 contributes nothing, the byte-identical comparison isolating it Canonicalizer no-op isolation
11 TestCrossTargetOracleDisjoin Two overlapping intervals on chr1 The oracle runs generic, datafusion, and duckdb targets Every target returns identical sub-segments, proving DISJOIN runs on DataFusion and agrees with DuckDB Three-target identity (#143 / #153)
12 TestCrossTargetOracleDisjoin Two overlapping intervals DISJOIN splits them on DuckDB Returns the expected split sub-segments DuckDB split correctness

@conradbzura conradbzura self-assigned this Jun 28, 2026
conradbzura added a commit that referenced this pull request Jun 28, 2026
Add a non-canonical cross-target oracle case so the portable star-EXCEPT passthrough executes on DataFusion, plus an engine-free regression pinning the per-branch cuts-CTE aliases. Document the REPLACE-vs-EXCEPT column-order divergence, centralize the DISJOIN prefix in a constants module, parse with parse_one, type the expander node, and restore the dropped rationale comments. Apply the shared registry-docstring, restore-in-place, and auto-discovery fixes.
@conradbzura conradbzura changed the title Migrate DISJOIN to a registered expander and solve the star-REPLACE portability limitation — Closes #143 Migrate DISJOIN to a registered expander and solve the star-REPLACE portability limitation — Closes #143, #153 Jun 29, 2026
Move DISJOIN off the legacy giqldisjoin_sql emitter onto the operator-expander
registry (epic #137 wave 3), the last operator migration. The expander assembles
the __giql_dj_* WITH-CTE subquery as AST and selects the full-row passthrough by
capability: SELECT * REPLACE where supports_star_replace holds (DuckDB), the
portable * EXCEPT projection otherwise (DataFusion family). Flip GIQL_EXPAND on
GIQLDisjoin and delete giqldisjoin_sql and its DISJOIN-only helpers from the
generator.

Also fix the duplicate-column bug: alias all four columns in every __giql_dj_cuts
UNION branch so the de-canonicalized end column no longer collides with the
end-cut under one output name, which DataFusion rejected — promoting the
previously pending cross-target DISJOIN case to a real three-target identity
test.

Squashed rebase onto main (post-#156/#157/#158) incorporating both review rounds:
non-canonical * EXCEPT oracle coverage, an engine-free cuts-CTE alias regression,
the DJ_PREFIX constant shared with the resolver, parse_one over maybe_parse,
typed expander node, refreshed comments, and the shared registry-seam
reconciliation.
@conradbzura conradbzura force-pushed the 143-migrate-disjoin-expander branch from 451a277 to 1aecfa3 Compare June 29, 2026 19:06
@conradbzura conradbzura marked this pull request as ready for review June 29, 2026 19:11
@conradbzura conradbzura merged commit e62f413 into main Jun 29, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant