Skip to content

Complete the DataFusion target and finalize capability-driven canonicalization — Closes #145#166

Merged
conradbzura merged 1 commit into
mainfrom
145-complete-datafusion-target
Jun 30, 2026
Merged

Complete the DataFusion target and finalize capability-driven canonicalization — Closes #145#166
conradbzura merged 1 commit into
mainfrom
145-complete-datafusion-target

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Choose the coordinate-canonicalization emit strategy from the active target's capabilities instead of hardcoding a DuckDB-only star REPLACE, completing the DataFusion target (epic #137). The CanonicalizeCoordinates pass now receives the target's Capabilities, and both the wrapper-CTE projection and the NEAREST row passthrough emit * REPLACE when supports_star_replace holds (DuckDB) and the portable * EXCEPT (start, end), <start>, <end> form otherwise (the generic / DataFusion family). This removes the two remaining hardcoded REPLACE assumptions, so a non-canonical coordinate encoding transpiles to engine-runnable SQL on DataFusion. DuckDB output is byte-unchanged; the generic/DataFusion EXCEPT form is row-equivalent (column order differs, since EXCEPT re-appends the recomputed interval columns) and is verified end-to-end on the real DataFusion engine across all four coordinate encodings, custom interval-column names, and strand. DataFusion serialization is finalized on the generic sqlglot output and its capability values promoted from provisional to verified. The SELECT *-over-correlated-NEAREST column leak (#160) remains deferred to the later query-level seam (#146).

Closes #145

Proposed changes

  • canonicalizer.pycanonicalize_coordinates(expression, capabilities=None) threads capabilities into _canonical_projection, which branches * REPLACE vs portable * EXCEPT; None preserves the historical REPLACE form for direct callers.
  • transpile.py — passes target.capabilities into the pass.
  • generators/base.py_nearest_passthrough gains the same capability gate (resolving its TODO(#142)).
  • expanders/nearest.py_distance_and_filters threads ctx.capabilities to the passthrough from both the LATERAL and decorrelated-fallback forms.
  • targets.pyDataFusionTarget docstring promoted from "provisional" to verified, recording which capability values the cross-target oracle validates.

Test cases

# Test Suite Given When Then Coverage Target
1 TestCanonicalProjectionCapabilities A non-canonical operand and capabilities lacking star-replace canonicalize_coordinates is called directly The wrapper emits the portable * EXCEPT form EXCEPT branch via the public capabilities arg
2 TestCanonicalProjectionCapabilities A non-canonical operand and REPLACE-capable capabilities canonicalize_coordinates is called directly The wrapper emits * REPLACE REPLACE branch via explicit capabilities
3 TestCanonicalProjectionCapabilities A non-canonical operand and no capabilities argument canonicalize_coordinates is called The wrapper defaults to * REPLACE capabilities=None historical-default contract
4 TestCanonicalProjectionCapabilities An AST already canonicalized once The pass runs a second time No second wrapper CTE is synthesized Idempotence of the EXCEPT wrapper
5 TestCanonicalProjectionCapabilities A non-canonical DISJOIN Transpiling with dialect="datafusion" The wrapper and passthrough emit * EXCEPT with no leaked operator DataFusion dialect threads its capabilities
6 TestNearestTargetCanonicalization / TestBaseGIQLGenerator A non-canonical NEAREST target Transpiling for DuckDB and generic DuckDB emits * REPLACE, generic emits * EXCEPT, sharing the distance CASE Wrapper + passthrough both branches via transpile
7 TestDisjoinEncodingSweepOnDataFusion Overlapping intervals in each of the four encodings The DISJOIN oracle runs all three targets Every target returns identical de-canonicalized sub-segments EXCEPT canonicalization on the real DataFusion engine, all encodings
8 TestNearestEncodingSweepOnDataFusion A non-canonical NEAREST target per encoding (incl. custom columns) The oracle runs datafusion vs duckdb Both return the same nearest interval and distance EXCEPT NEAREST passthrough on DataFusion, all encodings + custom columns
9 TestDataFusionSchemaAxes Custom-column DISJOIN and stranded MERGE The oracle runs all three targets Every target agrees Custom-column and strand axes on DataFusion

@conradbzura conradbzura self-assigned this Jun 30, 2026
Complete the DataFusion target (epic #137) by choosing the coordinate
canonicalization emit strategy from the active target's capabilities
rather than hardcoding a DuckDB-only star REPLACE.

The CanonicalizeCoordinates pass now takes the target's Capabilities, and
both the wrapper-CTE projection and the NEAREST row passthrough emit a
star REPLACE when supports_star_replace holds (DuckDB) and the portable
star EXCEPT form otherwise (the generic / DataFusion family). This
removes the two remaining hardcoded REPLACE assumptions — the
canonicalizer wrapper and the NEAREST passthrough — so a non-canonical
coordinate encoding transpiles to engine-runnable SQL on DataFusion. The
capability is threaded from transpile through the pass and through the
NEAREST expander; a direct caller that passes no capabilities keeps the
REPLACE form, preserving historical behavior.

DuckDB output is byte-unchanged. The generic and DataFusion targets now
emit the portable EXCEPT form, which is row-equivalent but not
column-order-equivalent (EXCEPT re-appends the recomputed interval
columns). This is verified end-to-end on the real DataFusion engine
across all four coordinate encodings, custom interval-column names, and
strand.

DataFusion serialization is finalized: it uses the generic sqlglot
output, validated by the cross-target oracle, and its capability values
are promoted from provisional to verified. The SELECT * over a correlated
NEAREST column leak remains deferred to a later query-level seam.
@conradbzura conradbzura force-pushed the 145-complete-datafusion-target branch from a25bee9 to 07d6cad Compare June 30, 2026 18:16
@conradbzura conradbzura marked this pull request as ready for review June 30, 2026 19:21
@conradbzura conradbzura merged commit 0efcfc4 into main Jun 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Complete the DataFusion target and finalize capability-driven canonicalization

1 participant