Skip to content

Migrate INTERSECTS, CONTAINS, WITHIN, and set predicates to registered expanders — Closes #141#157

Merged
conradbzura merged 1 commit into
mainfrom
141-migrate-spatial-predicate-expanders
Jun 29, 2026
Merged

Migrate INTERSECTS, CONTAINS, WITHIN, and set predicates to registered expanders — Closes #141#157
conradbzura merged 1 commit into
mainfrom
141-migrate-spatial-predicate-expanders

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Migrate the spatial predicates (INTERSECTS / CONTAINS / WITHIN) and the quantified set predicates (ANY / ALL) off the legacy *_sql emitters on BaseGIQLGenerator and onto the generic operator-expander registry (epic #137, wave 3). Each predicate now expands to standard boolean sqlglot AST in the new giql.expanders package during pass 3 (ExpandOperators), built from the pass-1 ResolvedColumn metadata already canonicalized to 0-based half-open by pass 2, so the emitted predicate SQL is byte-identical to what the deleted emitters produced.

Restructure transpile.py so the join-strategy rewrites are capability-gated on the target's range_join_strategy and defer to the registry. The binned equi-join and DuckDB IEJoin transformers run as pre-pass transformers that consume a column-to-column INTERSECTS join before expansion; a literal-range or residual column-to-column INTERSECTS predicate survives to pass 3 and is rendered by the new expander exactly as the legacy emitter rendered it. Add a registry-deferral seam: a target-specific (target, Intersects) registry entry overrides the built-in join strategy entirely, letting the INTERSECTS node flow untouched into ExpandOperators. This removes the old dialect="duckdb" IEJoin early-return that skipped the expansion pipeline.

Deferred: folding the whole-query IEJoin string emitter itself into a node-level expander. Per the issue's design note, IntersectsDuckDBIEJoinTransformer.transform_to_sql emits a whole-query SET VARIABLE …; SELECT … string and rewrites the top-level statement, which a node-replacing OperatorExpander.expand(node, ctx) -> exp.Expression contract cannot express. The join-strategy rewrites therefore stay as capability-gated pre-pass transformers rather than (target, op) registry entries; only the registry-deferral hook is added so a future target-specific override can supersede them.

This is epic #137 wave 3; it carries the shared ExpanderRegistry.snapshot()/restore() seam that sibling wave-3 PRs also introduce (dedupe on merge).

Closes #141

Proposed changes

ExpanderRegistry save/restore seam (src/giql/expander.py)

Add snapshot() and restore() to ExpanderRegistry. snapshot() returns a shallow copy of the current (target, operator) -> expander registrations; restore() drops every current entry and re-installs exactly the snapshot contents. This is the public seam an isolating test fixture (or a plugin) uses to mutate the process-wide REGISTRY around a body and return it to a captured baseline — so the built-in expanders registered at import survive a fixture that would otherwise clear() them permanently. Shared with sibling wave-3 PRs.

giql.expanders package and predicate expanders (src/giql/expanders/__init__.py, src/giql/expanders/intersects.py)

Add the giql.expanders package. Importing it registers every built-in expander as a side effect: __init__.py uses pkgutil.iter_modules to import each submodule, which decorates its expanders with @register(...) at import time, so new operator modules are picked up by dropping a file in without editing the package.

Add giql.expanders.intersects with four GenericTarget expanders: expand_intersects, expand_contains, expand_within, and expand_spatial_set (ANY / ALL). Each turns one predicate node into a parenthesized boolean built from ResolvedColumn fragments parsed through the GIQL dialect, reproducing the deleted emitter helpers as AST: _range_predicate (literal-range form, including the point-query special case for CONTAINS), _column_join (column-to-column residual form), and the dispatch-on-right-operand logic of the old _generate_spatial_op. The literal-range path reproduces the legacy parse-and-wrap-error behavior verbatim (the historical "Could not parse genomic range" message). Only generic expanders are registered, since spatial-predicate emission is portable SQL-92 and does not vary by engine.

Capability-gated join transformers and registry-deferral (src/giql/transpile.py)

Import giql.expanders once so the registry is populated before the first transpile. Compute target_overrides_intersects — true only for an exact non-generic (target, Intersects) entry, deliberately excluding the built-in (GenericTarget, Intersects) predicate expander so it does not disable the join rewrite. Gate both the IEJoin path (if uses_iejoin and not target_overrides_intersects) and the binned-join transformer on this flag: when a target-specific override is registered, the join rewrite is skipped and the INTERSECTS node flows into ExpandOperators. Remove the dialect="duckdb" early-return's pipeline-skip warning block — the IEJoin transformer still short-circuits with a whole-query string when it produces output (safe, since an IEJoin-eligible query carries exactly one INTERSECTS and leaves no residual predicate), but the registry is now consulted on the deferral path it used to preclude.

GIQL_EXPAND flips and emitter deletion (src/giql/expressions.py, src/giql/generators/base.py)

Flip GIQL_EXPAND from the shared inert default to True on Intersects, Contains, Within, and SpatialSetPredicate so the four predicates opt into pass 3. Delete the intersects_sql / contains_sql / within_sql / spatialsetpredicate_sql emitters from BaseGIQLGenerator and their _generate_spatial_op / _generate_spatial_set / _generate_range_predicate / _generate_column_join / _predicate_operand helpers, plus the now-unused imports.

Test updates (tests/test_expander.py, tests/generators/test_base.py)

Rework the registry/flag leak guards to compare against a captured baseline (REGISTRY.snapshot()) rather than asserting emptiness, since the registry now ships built-in expanders at import; clean_registry saves and restores that baseline through the new seam. Add _SHIPPED_EXPAND_FLAGS and derive _MIGRATED_OPERATORS / _UNMIGRATED_OPERATORS dynamically so the flag-leak guard restores each operator to its shipped default and the opt-out parametrization stays merge-stable across wave-3 branches. Add an _opted_out context manager (complement of _opted_in) for control tests that need a migrated operator to behave as unflagged. Replace the old strict-xfail TestIEJoinEarlyReturnSkipsExpansion with TestIEJoinRegistryDeferral, add snapshot/restore coverage, and route the generator-level spatial tests through pass 3 via the updated _generate_through_passes helper (now runs passes 1-3).

Test cases

# Test Suite Given When Then Coverage Target
1 TestExpanderRegistryFallbackGaps A registry with one entry captured by snapshot A second entry is registered after the snapshot The snapshot still holds only the first entry snapshot() is a copy, not a live view
2 TestExpanderRegistryFallbackGaps A snapshot taken, then the registry cleared and a different entry registered Restoring the snapshot The original entry resolves again and the post-snapshot entry is gone restore() replaces entries with snapshot contents
3 TestIEJoinRegistryDeferral A column-to-column INTERSECTS join eligible for the duckdb IEJoin path with a (DuckDBTarget, Intersects) override registered Transpiling with dialect='duckdb' The override expander's sentinel appears and no SET VARIABLE IEJoin SQL is emitted IEJoin path defers to a target-specific override
4 TestIEJoinRegistryDeferral The same IEJoin-eligible duckdb query with no target-specific override registered Transpiling with dialect='duckdb' The built-in IEJoin SET VARIABLE SQL is emitted Default duckdb path keeps the built-in IEJoin strategy
5 TestOperatorOptOut A GIQL operator class not migrated onto the pass Reading its GIQL_EXPAND class attribute It is False Unmigrated operators stay on the legacy emitter
6 TestOperatorOptOut A GIQL operator class migrated onto the pass Reading its GIQL_EXPAND class attribute It is True Migrated operators opt into expansion
7 TestExpandOperatorsPass An expander registered for (GenericTarget, GIQLDisjoin) but the operator's flag held off via _opted_out Running the pass The operator node is left unexpanded Per-type GIQL_EXPAND gate isolates dispatch
8 TestNoOpWhenInert A DISTANCE query (unmigrated operator) with the default registry Transpiling with the wired-in pass versus a pass-bypassed reference The SQL matches exactly with no expander alias prefix Pass is inert for any unmigrated operator
9 TestExpandOperatorsWalk A query with INTERSECTS opted in and DISJOIN opted out as the control Running the pass Only the flagged operator is expanded Pass walks and expands per opted-in type
10 TestBaseGIQLGenerator An invalid genomic range string in INTERSECTS The INTERSECTS predicate is expanded through passes 1-3 A ValueError matching "Could not parse genomic range" is raised Expander reproduces the legacy parse-error message
11 TestBaseGIQLGenerator A malformed range string ('chr:a-b') in INTERSECTS The predicate is expanded through passes 1-3 A ValueError is raised Malformed-range error surfaces via the expander

@conradbzura conradbzura self-assigned this Jun 28, 2026
conradbzura added a commit that referenced this pull request Jun 28, 2026
Remove unreachable dispatch branches in the predicate expanders, add ExpanderRegistry.has_override and route the join-deferral gate through it, guard intersects_bin_size under a target override, and preserve tracebacks on the parse-error wrap. Add direct expander tests, binned-target deferral coverage, and error-message characterization. Make the registry docstrings mechanistic and node-local, restore the registry in place, harden auto-discovery, and key the opt-out control on a dynamically derived migrated operator.
… expanders

Move INTERSECTS, CONTAINS, WITHIN, and the ANY/ALL set predicates off the legacy
*_sql emitters onto the operator-expander registry (epic #137 wave 3). Each
predicate expands to standard boolean AST built from the pass-1 resolved-column
metadata, byte-identical to the deleted emitters. Capability-gate the binned and
IEJoin join transformers on range_join_strategy and add a registry deferral
(ExpanderRegistry.has_override) so a target-specific INTERSECTS override
supersedes the built-in join rewrite, removing the IEJoin early-return's
pipeline skip. Flip GIQL_EXPAND on the four predicate classes and delete their
emitters and helpers.

Squashed rebase onto main (post-#156) incorporating both review rounds: remove
unreachable dispatch branches, add has_override with deferral-gate tests, guard
intersects_bin_size under an override, relocate the direct expander tests to
tests/expanders/test_intersects.py with CONTAINS/WITHIN column-join coverage,
add error-message characterization, and reconcile the shared registry seam and
docstrings.
@conradbzura conradbzura force-pushed the 141-migrate-spatial-predicate-expanders branch from 47212f6 to 5565263 Compare June 29, 2026 16:55
@conradbzura conradbzura marked this pull request as ready for review June 29, 2026 17:02
@conradbzura conradbzura merged commit 933ce0b into main Jun 29, 2026
3 checks passed
conradbzura added a commit that referenced this pull request Jun 29, 2026
…fallback

Move NEAREST off the legacy giqlnearest_sql emitter onto the operator-expander
registry (epic #137 wave 3). Lateral-capable targets get the portable correlated
LATERAL subquery; on DataFusion (no correlated-LATERAL physical plan) NEAREST
expands to a decorrelated ROW_NUMBER() window fallback that returns identical
rows, with a deterministic (start, end) tiebreaker and a synthesized subquery
alias so an unaliased correlated NEAREST also runs there (and under python -O).
Flip GIQL_EXPAND on GIQLNearest, delete giqlnearest_sql, SUPPORTS_LATERAL, and
_nearest_resolution, and make the shared distance/nearest helpers staticmethods.

Squashed rebase onto main (post-#156/#157) incorporating both review rounds:
cross-target oracle coverage (k>1 ties, opposite-strand co-located rows,
max_distance survivors), the SELECT * column-leak claim narrowed and xfail-pinned
(tracked by #160), annotated helpers, reserved names derived from
EXPAND_ALIAS_PREFIX, and the shared registry-seam reconciliation.
conradbzura added a commit that referenced this pull request Jun 29, 2026
Move DISJOIN off the legacy giqldisjoin_sql emitter onto the operator-expander
registry (epic #137 wave 3), the last operator migration. The expander assembles
the __giql_dj_* WITH-CTE subquery as AST and selects the full-row passthrough by
capability: SELECT * REPLACE where supports_star_replace holds (DuckDB), the
portable * EXCEPT projection otherwise (DataFusion family). Flip GIQL_EXPAND on
GIQLDisjoin and delete giqldisjoin_sql and its DISJOIN-only helpers from the
generator.

Also fix the duplicate-column bug: alias all four columns in every __giql_dj_cuts
UNION branch so the de-canonicalized end column no longer collides with the
end-cut under one output name, which DataFusion rejected — promoting the
previously pending cross-target DISJOIN case to a real three-target identity
test.

Squashed rebase onto main (post-#156/#157/#158) incorporating both review rounds:
non-canonical * EXCEPT oracle coverage, an engine-free cuts-CTE alias regression,
the DJ_PREFIX constant shared with the resolver, parse_one over maybe_parse,
typed expander node, refreshed comments, and the shared registry-seam
reconciliation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate INTERSECTS, CONTAINS, WITHIN, and set predicates to registered expanders

1 participant