Description
The DataFusion NEAREST path uses a decorrelated ROW_NUMBER() fallback (added in #142 / PR #158) because DataFusion has no correlated-LATERAL plan. The rewritten relation b must expose internal helper columns — the reference/partition key (__giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end) and the rank (__giql_x_rn) — because the decorrelated equi-join's ON a.<key> = b.__giql_x_rk_* AND b.__giql_x_rn <= k resolves against them. Consequently SELECT * / SELECT b.* over a correlated NEAREST returns those four reserved-prefixed columns on DataFusion that the DuckDB LATERAL form does not. The actual nearest rows and values are correct; the divergence is extra, internally-named columns leaking into user output and a per-target output schema difference.
Example: SELECT b.* FROM peaks a CROSS JOIN LATERAL NEAREST(genes, reference := a.interval, k := 1) AS b
- DuckDB / generic (LATERAL):
genes.*, distance
- DataFusion (fallback):
genes.*, distance, __giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end, __giql_x_rn
The interim mitigation in #158 narrows the cross-target identity claim to explicitly-projected queries and pins the divergence with an xfail/skip oracle case; this issue tracks the real fix.
Expected behavior
SELECT * / SELECT b.* over a correlated NEAREST returns the same column set on DataFusion as on DuckDB — the genuine passthrough columns plus distance, with no reserved __giql_x_* columns. The interim xfail oracle case is promoted to a real cross-target SELECT * identity test.
Root cause
The leak cannot be removed at the operator-local level: b.* is the user's projection, which a node-local OperatorExpander.expand(node, ctx) -> exp.Expression does not own, and the decorrelated join requires the key and rank columns to stay visible on b. Removing them needs a rewrite of the enclosing SELECT — a query-level capability the expander contract does not have today.
This depends on the query-level expander seam introduced as part of #146 (finalize and document the extension hook). Once that seam exists, NEAREST's DataFusion fallback registers a query-level finalizer that wraps the rewritten statement in SELECT * EXCEPT (__giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end, __giql_x_rn) FROM (...), gated to the DataFusion fallback path (the DuckDB LATERAL form never leaks). Verified on DataFusion 53.0.0: SELECT * EXCEPT (cols) is supported, and DataFusion tolerates duplicate output names from a.*, b.* through an * EXCEPT wrapper, so no column enumeration or full-schema knowledge is required (the "Projections require unique expression names" error from #153 is specific to UNION/set-operation branches, not plain derived-table star projection). The same query-level seam is shared with #141's deferred IEJoin whole-query fold.
Description
The DataFusion NEAREST path uses a decorrelated
ROW_NUMBER()fallback (added in #142 / PR #158) because DataFusion has no correlated-LATERAL plan. The rewritten relationbmust expose internal helper columns — the reference/partition key (__giql_x_rk_chrom,__giql_x_rk_start,__giql_x_rk_end) and the rank (__giql_x_rn) — because the decorrelated equi-join'sON a.<key> = b.__giql_x_rk_* AND b.__giql_x_rn <= kresolves against them. ConsequentlySELECT */SELECT b.*over a correlated NEAREST returns those four reserved-prefixed columns on DataFusion that the DuckDB LATERAL form does not. The actual nearest rows and values are correct; the divergence is extra, internally-named columns leaking into user output and a per-target output schema difference.Example:
SELECT b.* FROM peaks a CROSS JOIN LATERAL NEAREST(genes, reference := a.interval, k := 1) AS bgenes.*, distancegenes.*, distance, __giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end, __giql_x_rnThe interim mitigation in #158 narrows the cross-target identity claim to explicitly-projected queries and pins the divergence with an
xfail/skip oracle case; this issue tracks the real fix.Expected behavior
SELECT */SELECT b.*over a correlated NEAREST returns the same column set on DataFusion as on DuckDB — the genuine passthrough columns plusdistance, with no reserved__giql_x_*columns. The interimxfailoracle case is promoted to a real cross-targetSELECT *identity test.Root cause
The leak cannot be removed at the operator-local level:
b.*is the user's projection, which a node-localOperatorExpander.expand(node, ctx) -> exp.Expressiondoes not own, and the decorrelated join requires the key and rank columns to stay visible onb. Removing them needs a rewrite of the enclosingSELECT— a query-level capability the expander contract does not have today.This depends on the query-level expander seam introduced as part of #146 (finalize and document the extension hook). Once that seam exists, NEAREST's DataFusion fallback registers a query-level finalizer that wraps the rewritten statement in
SELECT * EXCEPT (__giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end, __giql_x_rn) FROM (...), gated to the DataFusion fallback path (the DuckDB LATERAL form never leaks). Verified on DataFusion 53.0.0:SELECT * EXCEPT (cols)is supported, and DataFusion tolerates duplicate output names froma.*, b.*through an* EXCEPTwrapper, so no column enumeration or full-schema knowledge is required (the "Projections require unique expression names" error from #153 is specific to UNION/set-operation branches, not plain derived-table star projection). The same query-level seam is shared with #141's deferred IEJoin whole-query fold.