Migrate NEAREST to a registered expander with capability-driven LATERAL fallback — Closes #142#158
Merged
Merged
Conversation
conradbzura
added a commit
that referenced
this pull request
Jun 28, 2026
Fix the literal-reference NEAREST crash on DataFusion by gating the decorrelated fallback on genuine correlation and materializing the distance in a two-level subquery. Add executing cross-target oracle cases (k>1, duplicate references, multi-key, max_distance, stranded, signed) and a deterministic tiebreaker so the LATERAL and window forms are set-equivalent. Delete dead helpers and SUPPORTS_LATERAL, make borrowed helpers static, mint fallback aliases via ctx.alias, add invariant asserts, and document DataFusion support. Apply the shared registry-docstring, restore-in-place, and auto-discovery fixes.
…fallback Move NEAREST off the legacy giqlnearest_sql emitter onto the operator-expander registry (epic #137 wave 3). Lateral-capable targets get the portable correlated LATERAL subquery; on DataFusion (no correlated-LATERAL physical plan) NEAREST expands to a decorrelated ROW_NUMBER() window fallback that returns identical rows, with a deterministic (start, end) tiebreaker and a synthesized subquery alias so an unaliased correlated NEAREST also runs there (and under python -O). Flip GIQL_EXPAND on GIQLNearest, delete giqlnearest_sql, SUPPORTS_LATERAL, and _nearest_resolution, and make the shared distance/nearest helpers staticmethods. Squashed rebase onto main (post-#156/#157) incorporating both review rounds: cross-target oracle coverage (k>1 ties, opposite-strand co-located rows, max_distance survivors), the SELECT * column-leak claim narrowed and xfail-pinned (tracked by #160), annotated helpers, reserved names derived from EXPAND_ALIAS_PREFIX, and the shared registry-seam reconciliation.
ed16839 to
6a4e33b
Compare
conradbzura
added a commit
that referenced
this pull request
Jun 29, 2026
Move DISJOIN off the legacy giqldisjoin_sql emitter onto the operator-expander registry (epic #137 wave 3), the last operator migration. The expander assembles the __giql_dj_* WITH-CTE subquery as AST and selects the full-row passthrough by capability: SELECT * REPLACE where supports_star_replace holds (DuckDB), the portable * EXCEPT projection otherwise (DataFusion family). Flip GIQL_EXPAND on GIQLDisjoin and delete giqldisjoin_sql and its DISJOIN-only helpers from the generator. Also fix the duplicate-column bug: alias all four columns in every __giql_dj_cuts UNION branch so the de-canonicalized end column no longer collides with the end-cut under one output name, which DataFusion rejected — promoting the previously pending cross-target DISJOIN case to a real three-target identity test. Squashed rebase onto main (post-#156/#157/#158) incorporating both review rounds: non-canonical * EXCEPT oracle coverage, an engine-free cuts-CTE alias regression, the DJ_PREFIX constant shared with the resolver, parse_one over maybe_parse, typed expander node, refreshed comments, and the shared registry-seam reconciliation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrate NEAREST off the string emitter and onto the
ExpandOperatorspass as a capability-driven expander. Emit the portable correlatedLATERALsubquery wherecapabilities.supports_lateralholds (DuckDB, the generic target), and a decorrelatedROW_NUMBER()window-function form where it does not. This ADDS DataFusion support for correlated NEAREST, which previously failed because DataFusion has no correlated-LATERAL physical plan. Prove the fallback returns row-for-row identical results to the LATERAL form across the k=1/2,max_distance,stranded,signed, and duplicate-reference-row cases, and promote the cross-target oracle's previously-pinned_unsupported_pending_142expected-failure to a real three-target identity test. FlipGIQL_EXPANDonGIQLNearestand delete the legacygiqlnearest_sqlemitter, keeping its shared_generate_distance_case/_nearest_*helpers.This is epic #137 wave 3; it carries the shared
ExpanderRegistry.snapshot()/restore()seam that sibling wave-3 PRs also have (dedupe on merge).Closes #142
Proposed changes
Registry save/restore seam
Add
ExpanderRegistry.snapshot()andExpanderRegistry.restore()— a public save/restore pair.snapshot()returns a fresh shallow copy of the(target, operator) -> expanderregistrations;restore()drops the current entries and re-installs exactly a captured snapshot. A test fixture (or plugin) that mutates the process-wideREGISTRYaround a body captures the baseline first and hands it back afterward, so the built-in expanders registered at import survive an isolating fixture that would otherwiseclear()them permanently.giql.expanderspackage and capability-drivennearest.pyAdd the
giql.expanderspackage. Its__init__walks its own submodules withpkgutil.iter_modulesand imports each, so dropping a<operator>.pyinto the package registers its@register(...)expander as an import side effect with no edit to the package file.giql.transpileimports the package once soREGISTRYis populated before the first transpile.Add
nearest.py.expand_nearestbranches onctx.capabilities.supports_lateraland on whether the node is correlated (parent is aLATERAL): lateral-capable targets and every standalone literal-reference placement get the portable LATERAL/standalone subquery, byte-identical to the legacy emitter; a correlated NEAREST on a target without LATERAL support gets the decorrelated fallback. Three load-bearing design points in the fallback:__giql_x_rk_*names into a renamed derived relation that the target is cross-joined against. DataFusion's planner cannot resolve a window ordering over a join whose two sides share column names (both exposestart/end), so the renamed columns keep every reference column distinct from the target's.ROW_NUMBER()is added in the enclosing one. Fused into one level, DataFusion's optimizer mis-derives the window's sort order from the chromosome-equality prefilter and tripsSanityCheckPlan.DISTINCT, candidates are ranked once per distinct reference value, and the rewritten join re-associates the top-kback to every outer row sharing that key. Ranking depends only on the reference value, so ranking once and re-joining is identical to the per-row LATERAL form even when the outer table holds duplicate reference rows.The fallback rewrites
<outer> AS a CROSS JOIN LATERAL (nearest) AS binto<outer> AS a JOIN (<ranked subquery>) AS b ON <ref-key match> AND b.<rn> <= kin place.GIQLNearest.GIQL_EXPANDflip and emitter deletionFlip
GIQLNearest.GIQL_EXPANDtoTrueso NEAREST expands through its registered expander, and deleteBaseGIQLGenerator.giqlnearest_sql(including the old SQLite "LATERAL not supported"ValueErrorbranch). The self-free_generate_distance_case(shared with DISTANCE, #140) and the_nearest_*resolution / passthrough / output-encoding helpers stay onBaseGIQLGeneratorand are reused by the expander, so distance, passthrough, and encoding round-tripping remain byte-for-byte identical.Test updates and the promoted oracle test
Update the emitter-level NEAREST tests to run pass 3 (
ExpandOperators) before generating, matchingtranspile. Addsnapshot/restoreregistry tests and route the registry-isolation fixtures through the new seam. Add a parametrized check that every migrated operator shipsGIQL_EXPAND=True. Promote the cross-target oracle'stest_nearest_on_datafusion_unsupported_pending_142pytest.raises(match="OuterReferenceColumn")pin totest_correlated_nearest_k1_agrees_across_all_targets, a real three-target identity test (generic and duckdb on the LATERAL form via DuckDB, datafusion on the decorrelated fallback).Test cases
TestNearestTranspilationNEAREST(genes, reference := peaks.interval, k := 3)ORDER BY, andLIMIT 3is generatedTestNearestTranspilationNEAREST(..., k := 5, max_distance := 100000)100000distance filter andLIMIT 5is generatedmax_distancefilterTestNearestTranspilationreference := 'chr1:1000-2000'TestNearestTranspilationstranded := trueTestNearestTranspilationsigned := trueTestExpanderRegistryFallbackGapssnapshot()snapshot()independenceTestExpanderRegistryFallbackGapsrestore()restore()semanticsTestOperatorOptOutGIQL_EXPANDattributeTrue, so the operator expands through its registered expanderTestOperatorOptOutGIQL_EXPANDattributeFalse, so the operator still uses the legacy emitterTestCrossTargetOracleNearestCROSS JOIN LATERAL NEAREST(..., k := 1)runs on generic, duckdb, and datafusion_unsupported_pending_142)