Skip to content

Migrate NEAREST to a registered expander with capability-driven LATERAL fallback — Closes #142#158

Merged
conradbzura merged 1 commit into
mainfrom
142-migrate-nearest-expander
Jun 29, 2026
Merged

Migrate NEAREST to a registered expander with capability-driven LATERAL fallback — Closes #142#158
conradbzura merged 1 commit into
mainfrom
142-migrate-nearest-expander

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Migrate NEAREST off the string emitter and onto the ExpandOperators pass as a capability-driven expander. Emit the portable correlated LATERAL subquery where capabilities.supports_lateral holds (DuckDB, the generic target), and a decorrelated ROW_NUMBER() window-function form where it does not. This ADDS DataFusion support for correlated NEAREST, which previously failed because DataFusion has no correlated-LATERAL physical plan. Prove the fallback returns row-for-row identical results to the LATERAL form across the k=1/2, max_distance, stranded, signed, and duplicate-reference-row cases, and promote the cross-target oracle's previously-pinned _unsupported_pending_142 expected-failure to a real three-target identity test. Flip GIQL_EXPAND on GIQLNearest and delete the legacy giqlnearest_sql emitter, keeping its shared _generate_distance_case / _nearest_* helpers.

This is epic #137 wave 3; it carries the shared ExpanderRegistry.snapshot() / restore() seam that sibling wave-3 PRs also have (dedupe on merge).

Closes #142

Proposed changes

Registry save/restore seam

Add ExpanderRegistry.snapshot() and ExpanderRegistry.restore() — a public save/restore pair. snapshot() returns a fresh shallow copy of the (target, operator) -> expander registrations; restore() drops the current entries and re-installs exactly a captured snapshot. A test fixture (or plugin) that mutates the process-wide REGISTRY around a body captures the baseline first and hands it back afterward, so the built-in expanders registered at import survive an isolating fixture that would otherwise clear() them permanently.

giql.expanders package and capability-driven nearest.py

Add the giql.expanders package. Its __init__ walks its own submodules with pkgutil.iter_modules and imports each, so dropping a <operator>.py into the package registers its @register(...) expander as an import side effect with no edit to the package file. giql.transpile imports the package once so REGISTRY is populated before the first transpile.

Add nearest.py. expand_nearest branches on ctx.capabilities.supports_lateral and on whether the node is correlated (parent is a LATERAL): lateral-capable targets and every standalone literal-reference placement get the portable LATERAL/standalone subquery, byte-identical to the legacy emitter; a correlated NEAREST on a target without LATERAL support gets the decorrelated fallback. Three load-bearing design points in the fallback:

  • Pre-projected reference relation — the outer relation's reference columns are projected under fresh __giql_x_rk_* names into a renamed derived relation that the target is cross-joined against. DataFusion's planner cannot resolve a window ordering over a join whose two sides share column names (both expose start / end), so the renamed columns keep every reference column distinct from the target's.
  • Separate query levels for join and window — the cross-join, distance, and reference-key projection are computed in an inner subquery, and ROW_NUMBER() is added in the enclosing one. Fused into one level, DataFusion's optimizer mis-derives the window's sort order from the chromosome-equality prefilter and trips SanityCheckPlan.
  • DISTINCT-on-key with top-k fan-out — the reference relation is de-duplicated on the reference key (position, plus strand in stranded mode) with DISTINCT, candidates are ranked once per distinct reference value, and the rewritten join re-associates the top-k back to every outer row sharing that key. Ranking depends only on the reference value, so ranking once and re-joining is identical to the per-row LATERAL form even when the outer table holds duplicate reference rows.

The fallback rewrites <outer> AS a CROSS JOIN LATERAL (nearest) AS b into <outer> AS a JOIN (<ranked subquery>) AS b ON <ref-key match> AND b.<rn> <= k in place.

GIQLNearest.GIQL_EXPAND flip and emitter deletion

Flip GIQLNearest.GIQL_EXPAND to True so NEAREST expands through its registered expander, and delete BaseGIQLGenerator.giqlnearest_sql (including the old SQLite "LATERAL not supported" ValueError branch). The self-free _generate_distance_case (shared with DISTANCE, #140) and the _nearest_* resolution / passthrough / output-encoding helpers stay on BaseGIQLGenerator and are reused by the expander, so distance, passthrough, and encoding round-tripping remain byte-for-byte identical.

Test updates and the promoted oracle test

Update the emitter-level NEAREST tests to run pass 3 (ExpandOperators) before generating, matching transpile. Add snapshot / restore registry tests and route the registry-isolation fixtures through the new seam. Add a parametrized check that every migrated operator ships GIQL_EXPAND=True. Promote the cross-target oracle's test_nearest_on_datafusion_unsupported_pending_142 pytest.raises(match="OuterReferenceColumn") pin to test_correlated_nearest_k1_agrees_across_all_targets, a real three-target identity test (generic and duckdb on the LATERAL form via DuckDB, datafusion on the decorrelated fallback).

Test cases

# Test Suite Given When Then Coverage Target
1 TestNearestTranspilation A query with NEAREST(genes, reference := peaks.interval, k := 3) Transpiling through passes 1-3 A LATERAL join with a distance column, ORDER BY, and LIMIT 3 is generated Correlated LATERAL form
2 TestNearestTranspilation A query with NEAREST(..., k := 5, max_distance := 100000) Transpiling A LATERAL subquery carrying the 100000 distance filter and LIMIT 5 is generated max_distance filter
3 TestNearestTranspilation A query with a literal reference := 'chr1:1000-2000' Transpiling A standalone subquery with no LATERAL and the literal coordinates is generated Standalone literal reference
4 TestNearestTranspilation A query with stranded := true Transpiling A LATERAL subquery with strand filtering is generated Stranded mode
5 TestNearestTranspilation A query with signed := true Transpiling A LATERAL subquery with the signed-distance calculation is generated Signed distance
6 TestExpanderRegistryFallbackGaps A registry with one entry captured by snapshot() A second entry is registered after the snapshot The snapshot holds only the first entry, being a copy not a live view snapshot() independence
7 TestExpanderRegistryFallbackGaps A snapshot taken before the registry is cleared and a different entry registered Calling restore() The original entry resolves again and the post-snapshot entry is gone restore() semantics
8 TestOperatorOptOut Each migrated GIQL operator class Reading its GIQL_EXPAND attribute It is True, so the operator expands through its registered expander Migrated opt-in flag
9 TestOperatorOptOut Each unmigrated GIQL operator class Reading its GIQL_EXPAND attribute It is False, so the operator still uses the legacy emitter Unmigrated opt-out flag
10 TestCrossTargetOracleNearest A single-row peaks table and three candidate genes at varying distances on chr1 A correlated CROSS JOIN LATERAL NEAREST(..., k := 1) runs on generic, duckdb, and datafusion Every target returns the single nearest gene and agrees, with datafusion on the decorrelated window-function fallback Cross-target identity (promoted from _unsupported_pending_142)

@conradbzura conradbzura self-assigned this Jun 28, 2026
conradbzura added a commit that referenced this pull request Jun 28, 2026
Fix the literal-reference NEAREST crash on DataFusion by gating the decorrelated fallback on genuine correlation and materializing the distance in a two-level subquery. Add executing cross-target oracle cases (k>1, duplicate references, multi-key, max_distance, stranded, signed) and a deterministic tiebreaker so the LATERAL and window forms are set-equivalent. Delete dead helpers and SUPPORTS_LATERAL, make borrowed helpers static, mint fallback aliases via ctx.alias, add invariant asserts, and document DataFusion support. Apply the shared registry-docstring, restore-in-place, and auto-discovery fixes.
…fallback

Move NEAREST off the legacy giqlnearest_sql emitter onto the operator-expander
registry (epic #137 wave 3). Lateral-capable targets get the portable correlated
LATERAL subquery; on DataFusion (no correlated-LATERAL physical plan) NEAREST
expands to a decorrelated ROW_NUMBER() window fallback that returns identical
rows, with a deterministic (start, end) tiebreaker and a synthesized subquery
alias so an unaliased correlated NEAREST also runs there (and under python -O).
Flip GIQL_EXPAND on GIQLNearest, delete giqlnearest_sql, SUPPORTS_LATERAL, and
_nearest_resolution, and make the shared distance/nearest helpers staticmethods.

Squashed rebase onto main (post-#156/#157) incorporating both review rounds:
cross-target oracle coverage (k>1 ties, opposite-strand co-located rows,
max_distance survivors), the SELECT * column-leak claim narrowed and xfail-pinned
(tracked by #160), annotated helpers, reserved names derived from
EXPAND_ALIAS_PREFIX, and the shared registry-seam reconciliation.
@conradbzura conradbzura force-pushed the 142-migrate-nearest-expander branch from ed16839 to 6a4e33b Compare June 29, 2026 18:26
@conradbzura conradbzura marked this pull request as ready for review June 29, 2026 18:51
@conradbzura conradbzura merged commit a5f2bbc into main Jun 29, 2026
3 checks passed
conradbzura added a commit that referenced this pull request Jun 29, 2026
Move DISJOIN off the legacy giqldisjoin_sql emitter onto the operator-expander
registry (epic #137 wave 3), the last operator migration. The expander assembles
the __giql_dj_* WITH-CTE subquery as AST and selects the full-row passthrough by
capability: SELECT * REPLACE where supports_star_replace holds (DuckDB), the
portable * EXCEPT projection otherwise (DataFusion family). Flip GIQL_EXPAND on
GIQLDisjoin and delete giqldisjoin_sql and its DISJOIN-only helpers from the
generator.

Also fix the duplicate-column bug: alias all four columns in every __giql_dj_cuts
UNION branch so the de-canonicalized end column no longer collides with the
end-cut under one output name, which DataFusion rejected — promoting the
previously pending cross-target DISJOIN case to a real three-target identity
test.

Squashed rebase onto main (post-#156/#157/#158) incorporating both review rounds:
non-canonical * EXCEPT oracle coverage, an engine-free cuts-CTE alias regression,
the DJ_PREFIX constant shared with the resolver, parse_one over maybe_parse,
typed expander node, refreshed comments, and the shared registry-seam
reconciliation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate NEAREST to a registered expander with capability-driven LATERAL fallback

1 participant