Hide NEAREST decorrelated-fallback internal columns from SELECT * on DataFusion — Closes #160#171
Draft
conradbzura wants to merge 1 commit into
Draft
Hide NEAREST decorrelated-fallback internal columns from SELECT * on DataFusion — Closes #160#171conradbzura wants to merge 1 commit into
conradbzura wants to merge 1 commit into
Conversation
A correlated NEAREST on DataFusion uses a decorrelated ROW_NUMBER fallback whose join must expose reserved rank/key columns for its ON clause; a SELECT * or SELECT b.* over it leaked those internal columns into user output, diverging from the DuckDB LATERAL form's schema. Node-local expanders cannot rewrite the enclosing statement, so this adds a query-level seam: ExpansionContext.add_statement_finalizer registers a StatementFinalizer that expand_operators applies to the statement root after all node-local replacements (it may return a new root). The NEAREST DataFusion fallback registers one that wraps the enclosing SELECT in SELECT * EXCEPT (...) when, and only when, a surfacing star projection would expose the reserved columns, so explicit projections are left untouched. StatementFinalizer is exported from giql for parity with OperatorExpander. Serialization is unchanged for every other query and target; the cross-target identity claim now holds for star projections too. Claude-Session: https://claude.ai/code/session_01ALxmQysPad4W68wuWuft6W
b2d70b5 to
a87ef98
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A correlated
NEARESTon DataFusion has no correlated-LATERALplan, so it expands to a decorrelatedROW_NUMBER()fallback whose rewritten joinbmust expose reserved rank/key helper columns (__giql_x_rk_*,__giql_x_rn) for itsONclause. ASELECT */SELECT b.*over it leaked those internal columns into user output, diverging from the DuckDBLATERALform's schema.A node-local
OperatorExpandercannot rewrite the enclosing query, so this adds a query-level statement-finalizer seam to the expander framework and uses it to project the reserved columns away. The seam is a supported public extension point; the NEAREST fix is its first consumer. Serialization is unchanged for every other query and target, and the cross-target result-identity claim now holds for star projections too (the interimxfailis promoted to a real identity test).Closes #160
Proposed changes
Add a query-level statement-finalizer seam
Introduce
StatementFinalizer(aCallable[[Expression], Expression]) andExpansionContext.add_statement_finalizer.expand_operatorsthreads one shared finalizer list through the run and applies each registered finalizer to the statement root, in registration order, after all node-local replacements — so a finalizer may return a new root. Eachexpand_operatorscall owns its own list, so a re-entrant CLUSTER/MERGE call finalizes its own subtree. ExportStatementFinalizerfromgiqlfor parity withOperatorExpander, and document the seam in the extension guide.Hide the NEAREST fallback's reserved columns
Register a finalizer from the DataFusion fallback that wraps the join's enclosing
SELECT(resolved lazily viajoin.parent_select) inSELECT * EXCEPT (...)— but only when the projection surfaces the reserved columns (an unqualified*orb.*). Leave explicit projections anda.*untouched, since wrapping absent columns would fail at engine runtime. Target the enclosing select rather than the statement root so aSELECT *over an explicit-only inner query stays correct.Reconcile docs
Close the "documented gap" notes in the
DataFusionTargetdocstring anddistance-operators.rst, and documentadd_statement_finalizeras the query-level boundary inextending.rst.Test cases
TestStatementFinalizerexpand_operatorsrunsTestStatementFinalizerexpand_operatorsrunsTestStatementFinalizerexpand_operatorsrunsTestStatementFinalizerexpand_operatorscallsTestNearestFallbackReservedColumnProjectionSELECT b.*orSELECT *NEAREST on DataFusionSELECT * EXCEPT (...)whose EXCEPT set equals the reserved columns exactlyTestNearestFallbackReservedColumnProjectionSELECT b.*NEAREST on DataFusion__giql_x_rk_strandTestNearestFallbackReservedColumnProjectiona.*-only, and nested-star-over-explicit projectionsTestNearestFallbackReservedColumnProjectionSELECT *over an innerSELECT b.*, and the generic LATERAL targetTestPublicApiSurfacegiqlpackageStatementFinalizeris imported from the root andgiql.expander__all__s, and is the same objectTestCrossTargetOracleNearestb.*,*, strandedb.*, ork := 3b.*NEAREST