Skip to content

Hide NEAREST decorrelated-fallback internal columns from SELECT * on DataFusion #160

Description

@conradbzura

Description

The DataFusion NEAREST path uses a decorrelated ROW_NUMBER() fallback (added in #142 / PR #158) because DataFusion has no correlated-LATERAL plan. The rewritten relation b must expose internal helper columns — the reference/partition key (__giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end) and the rank (__giql_x_rn) — because the decorrelated equi-join's ON a.<key> = b.__giql_x_rk_* AND b.__giql_x_rn <= k resolves against them. Consequently SELECT * / SELECT b.* over a correlated NEAREST returns those four reserved-prefixed columns on DataFusion that the DuckDB LATERAL form does not. The actual nearest rows and values are correct; the divergence is extra, internally-named columns leaking into user output and a per-target output schema difference.

Example: SELECT b.* FROM peaks a CROSS JOIN LATERAL NEAREST(genes, reference := a.interval, k := 1) AS b

  • DuckDB / generic (LATERAL): genes.*, distance
  • DataFusion (fallback): genes.*, distance, __giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end, __giql_x_rn

The interim mitigation in #158 narrows the cross-target identity claim to explicitly-projected queries and pins the divergence with an xfail/skip oracle case; this issue tracks the real fix.

Expected behavior

SELECT * / SELECT b.* over a correlated NEAREST returns the same column set on DataFusion as on DuckDB — the genuine passthrough columns plus distance, with no reserved __giql_x_* columns. The interim xfail oracle case is promoted to a real cross-target SELECT * identity test.

Root cause

The leak cannot be removed at the operator-local level: b.* is the user's projection, which a node-local OperatorExpander.expand(node, ctx) -> exp.Expression does not own, and the decorrelated join requires the key and rank columns to stay visible on b. Removing them needs a rewrite of the enclosing SELECT — a query-level capability the expander contract does not have today.

This depends on the query-level expander seam introduced as part of #146 (finalize and document the extension hook). Once that seam exists, NEAREST's DataFusion fallback registers a query-level finalizer that wraps the rewritten statement in SELECT * EXCEPT (__giql_x_rk_chrom, __giql_x_rk_start, __giql_x_rk_end, __giql_x_rn) FROM (...), gated to the DataFusion fallback path (the DuckDB LATERAL form never leaks). Verified on DataFusion 53.0.0: SELECT * EXCEPT (cols) is supported, and DataFusion tolerates duplicate output names from a.*, b.* through an * EXCEPT wrapper, so no column enumeration or full-schema knowledge is required (the "Projections require unique expression names" error from #153 is specific to UNION/set-operation branches, not plain derived-table star projection). The same query-level seam is shared with #141's deferred IEJoin whole-query fold.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions