Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/dialect/distance-operators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -312,11 +312,11 @@ Find nearby same-strand features within distance constraints:
Target support
~~~~~~~~~~~~~~

A correlated ``NEAREST`` (its reference is an outer-row column) runs on lateral-capable engines — DuckDB and the generic target — via a correlated ``LATERAL`` subquery, and on Apache DataFusion, which has no correlated-``LATERAL`` physical plan, via a decorrelated window-function rewrite. For an **explicitly-projected** query (one that selects named columns, e.g. ``SELECT a.start, b.start, b.distance``) the two forms return identical results: the ``(start, end)`` tiebreaker orders rows tied at the k-th distance the same way on every engine, deterministically whenever ``(start, end)`` distinguishes the tied candidates. A standalone ``NEAREST`` with a literal reference is uncorrelated and uses the same ordered, limited subquery on every target.
A correlated ``NEAREST`` (its reference is an outer-row column) runs on lateral-capable engines — DuckDB and the generic target — via a correlated ``LATERAL`` subquery, and on Apache DataFusion, which has no correlated-``LATERAL`` physical plan, via a decorrelated window-function rewrite. For a single correlated ``NEAREST`` per query the two forms return the same result set, including under ``SELECT *`` / ``SELECT b.*`` (on DataFusion the internal helper columns are projected away — see the note below): the ``(start, end)`` tiebreaker orders rows tied at the k-th distance the same way on every engine, deterministically whenever ``(start, end)`` distinguishes the tied candidates. A trailing top-level ``ORDER BY`` is preserved as a top-level ordering only on the ``LATERAL`` path; under the DataFusion star-projection wrapper it sinks into the wrapped subquery, so rely on the *row set* rather than its order there. Not covered — both leak the helper columns on DataFusion, parallel to the residual noted for ``DataFusionTarget``: two correlated ``NEAREST`` fallbacks in one query, and a correlated ``NEAREST`` whose reserved columns are re-surfaced by an enclosing ``SELECT *`` *outside* its own SELECT (e.g. a wrapping ``CLUSTER``). A standalone ``NEAREST`` with a literal reference is uncorrelated and uses the same ordered, limited subquery on every target.

.. note::

**Known limitation —** ``SELECT *`` **/** ``SELECT b.*`` **over a correlated NEAREST on DataFusion.** The decorrelated window-function rewrite needs its reference-key and rank columns (``__giql_x_rk_*``, ``__giql_x_rn``) visible on the rewritten join, so a ``SELECT *`` or ``SELECT b.*`` over a correlated NEAREST exposes those reserved internal columns on DataFusion — a different output schema than the LATERAL form emits on DuckDB. The cross-target identity claim above therefore holds for **explicitly-projected** queries only. Projecting named columns avoids the leak entirely. A query-level wrapper that projects the reserved columns away on the DataFusion path is tracked by `#160 <https://github.com/abdenlab/giql/issues/160>`_ (it depends on the query-level expander seam from #146).
``SELECT *`` **/** ``SELECT b.*`` **over a correlated NEAREST on DataFusion.** The decorrelated window-function rewrite must expose its reference-key and rank columns (``__giql_x_rk_*``, ``__giql_x_rn``) on the rewritten join for the equi-join to resolve. To keep them out of user output, the DataFusion path wraps the enclosing ``SELECT`` in ``SELECT * EXCEPT (<reference-key and rank columns>) FROM (...)`` (the ``EXCEPT`` list is the explicit reserved column names) when a ``SELECT *`` / ``SELECT b.*`` would surface them, so those projections return the same columns as the DuckDB LATERAL form (#160). Explicitly-projected queries never surface the reserved columns and get no wrapper.

Notes
~~~~~
Expand Down
34 changes: 29 additions & 5 deletions docs/transpilation/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -143,11 +143,35 @@ one expression that replaces the operator node in place. It cannot *return* a
reshaped enclosing query. An expander may still restructure the query it sits in
as a side effect and then return the node unchanged — the built-in CLUSTER and
MERGE expanders do exactly this, rewriting their single-table ``SELECT`` in place.
What no expander can express is a rewrite that **adds or reshapes joins** across
relations: the DuckDB IEJoin plan for column-to-column INTERSECTS joins is handled
by a capability-gated pre-pass transformer, not an expander, because it restructures
the surrounding join. A general query-level expander seam for such join rewrites is
planned future work.

When an expander must rewrite the **enclosing statement** — wrap the top-level
``SELECT``, or reshape a projection it does not own — it registers a *statement
finalizer* via :meth:`~giql.expander.ExpansionContext.add_statement_finalizer`.
The pass applies every registered finalizer to the statement, in registration
order, after all node-local replacements complete; each receives the current
statement root and returns the (possibly new) root. The built-in NEAREST
DataFusion fallback uses this to wrap its output in
``SELECT * EXCEPT (...)`` and hide the reserved rank/key columns its decorrelated
join must expose:

.. code-block:: python

def expand(self, node, ctx):
# ... rewrite the node / enclosing join in place ...
ctx.add_statement_finalizer(lambda root: wrap_or_return(root))
return node

A finalizer's returned root is emitted **as-is** — the pass does not re-validate
it — so a finalizer that reshapes a projection must not reference columns or
relations absent from what it rewrites. Wrapping a projection in
``SELECT * EXCEPT (missing_col)``, for instance, builds without error at transpile
time but fails at engine runtime. The built-in fallback guards this by wrapping
only when the projection genuinely surfaces the columns it excepts; a custom
finalizer should apply the same discipline.

The one query-level rewrite that is *not* an expander is a fold that **adds or
reshapes joins** across relations: the DuckDB IEJoin plan for column-to-column
INTERSECTS joins stays a capability-gated pre-pass transformer by design.


Undoing a registration
Expand Down
2 changes: 2 additions & 0 deletions src/giql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from giql.expander import ExpanderRegistry
from giql.expander import ExpansionContext
from giql.expander import OperatorExpander
from giql.expander import StatementFinalizer
from giql.expander import register
from giql.table import Table
from giql.targets import Capabilities
Expand All @@ -30,6 +31,7 @@
"ExpanderRegistry",
"ExpansionContext",
"OperatorExpander",
"StatementFinalizer",
"Target",
"Capabilities",
"GenericTarget",
Expand Down
101 changes: 89 additions & 12 deletions src/giql/expander.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@
"EXPAND_ALIAS_PREFIX",
"ExpansionContext",
"OperatorExpander",
"StatementFinalizer",
"ExpanderRegistry",
"RegistrySnapshot",
"REGISTRY",
Expand Down Expand Up @@ -113,9 +114,24 @@ class ExpansionContext:
SELECT it just restructured and expand sibling operators it copied into
it, honoring a custom-registry pass run. ``None`` for a standalone
context built outside the pass.

A node-local expander that needs to rewrite the *enclosing* statement (rather
than just replace its own node) registers a :data:`StatementFinalizer` via
:meth:`add_statement_finalizer`; the pass applies every finalizer to the
statement after all node-local replacements. This is the query-level seam for
a target whose expansion must reshape the enclosing statement — for example to
project internal helper columns out of a surfacing ``SELECT *``.
"""

__slots__ = ("node", "resolution", "target", "tables", "registry", "_alias_seq")
__slots__ = (
"node",
"resolution",
"target",
"tables",
"registry",
"_alias_seq",
"_finalizers",
)

def __init__(
self,
Expand All @@ -125,6 +141,7 @@ def __init__(
tables: Tables,
alias_seq: Callable[[], str] | None = None,
registry: ExpanderRegistry | None = None,
finalizers: list[StatementFinalizer] | None = None,
) -> None:
self.node = node
self.resolution = resolution
Expand All @@ -135,12 +152,37 @@ def __init__(
# ``ExpandOperators`` run so aliases minted for sibling operators never
# collide; a standalone context falls back to its own sequence.
self._alias_seq = alias_seq or name_sequence(EXPAND_ALIAS_PREFIX)
# A single finalizer list is likewise shared across one run's contexts so
# a finalizer registered while expanding one node is applied once, after
# every node-local replacement; a standalone context gets its own (inert
# unless someone drives it manually).
self._finalizers = finalizers if finalizers is not None else []

@property
def capabilities(self):
"""The active target's :class:`~giql.targets.Capabilities`."""
return self.target.capabilities

def add_statement_finalizer(self, finalizer: StatementFinalizer) -> None:
"""Register a query-level :data:`StatementFinalizer` for this run.

The **query-level seam**: an expander is node-local — its return value
replaces only its own node — so a target that must rewrite the *enclosing*
statement (for example to project internal helper columns out of a
surfacing ``SELECT *``) registers a finalizer here instead.
:func:`expand_operators` applies every registered finalizer to the
statement, in registration order, *after* all node-local replacements
complete; each receives the current statement root and returns the
(possibly new) root.

A finalizer's returned root is emitted verbatim — it is **not** re-validated
— so it must not reference columns or relations absent from the projection
it rewrites: a wrapper over an absent column builds without error but fails
at engine runtime. Finalizers registered on a standalone context (one built
outside the pass) are collected but never applied.
"""
self._finalizers.append(finalizer)

def alias(self) -> str:
"""Mint a fresh, query-unique alias with the reserved expander prefix.

Expand All @@ -164,13 +206,15 @@ class OperatorExpander(Protocol):
``OperatorExpander`` (it has no ``expand`` method); register one by wrapping
it (see :func:`register`, which accepts either form).

An expander is **node-local**: ``expand(node, ctx) -> exp.Expression`` sees
one operator node and returns the expression that replaces it in place. It
cannot express a whole-query rewrite such as the INTERSECTS IEJoin fold,
which restructures the surrounding query (joins, CTEs) rather than a single
node. That fold is therefore deferred — it would need a separate
query-level mechanism — and is handled by the pre-pass join transformers, not
by an expander.
An expander's **return value** is node-local: ``expand(node, ctx) ->
exp.Expression`` returns the one expression that replaces the operator node in
place. When a target additionally needs to rewrite the *enclosing* statement —
for example to project internal helper columns away from a surfacing
``SELECT *`` — the expander registers a query-level :data:`StatementFinalizer`
via :meth:`ExpansionContext.add_statement_finalizer`, applied to the statement
after every node-local replacement. The INTERSECTS IEJoin whole-query fold is a
separate concern still handled by the pre-pass join transformers, not by an
expander.
"""

def expand(self, node: exp.Expression, ctx: ExpansionContext) -> exp.Expression: ...
Expand All @@ -180,6 +224,14 @@ def expand(self, node: exp.Expression, ctx: ExpansionContext) -> exp.Expression:
#: registry stores either an :class:`OperatorExpander` object or one of these.
ExpanderFn = Callable[[exp.Expression, ExpansionContext], exp.Expression]

#: A query-level statement finalizer: ``finalize(root) -> root``. An expander
#: registers one via :meth:`ExpansionContext.add_statement_finalizer` to wrap or
#: rewrite the enclosing statement after every node-local replacement; the pass
#: applies each in registration order and threads the (possibly new) root through.
#: Used, for example, to project internal helper columns out of a surfacing
#: ``SELECT *`` / ``b.*``.
StatementFinalizer = Callable[[exp.Expression], exp.Expression]


def _as_callable(expander: OperatorExpander | ExpanderFn) -> ExpanderFn:
"""Normalize an expander to a plain ``(node, ctx) -> Expression`` callable."""
Expand Down Expand Up @@ -524,7 +576,11 @@ def expand_operators(
internal invariant violation (a built-in operator always has at least a
``(generic, op)`` expander) and raises — there is no legacy ``*_sql`` fallback.

The pass mutates and returns *expression* in place.
The pass mutates *expression* in place for the node-local replacements, then
applies any :data:`StatementFinalizer` an expander registered (via
:meth:`ExpansionContext.add_statement_finalizer`) to the statement in
registration order. A finalizer may return a *new* root, so callers must use
the return value rather than assume in-place mutation.

Parameters
----------
Expand All @@ -541,12 +597,16 @@ def expand_operators(
Returns
-------
exp.Expression
The same *expression*, with each operator node replaced by its
target-specific expansion.
*expression* with each operator node replaced by its target-specific
expansion, after any registered statement finalizers have run — the same
object when no finalizer replaced the root, otherwise the finalized root.
"""
reg = registry if registry is not None else REGISTRY
operators = _GIQL_OPERATORS
alias_seq = name_sequence(EXPAND_ALIAS_PREFIX)
# Shared across every context this run builds: an expander that must rewrite
# the enclosing statement appends a finalizer here, applied after the walk.
finalizers: list[StatementFinalizer] = []

# Collect first, then mutate: replacing nodes mid-walk is unsafe.
pending: list[tuple[exp.Expression, ExpanderFn]] = []
Expand Down Expand Up @@ -591,7 +651,10 @@ def expand_operators(
"valid resolution metadata; pass 1 (resolve_operator_refs) must "
"run first and annotate every operator node."
)
ctx = ExpansionContext(node, resolution, target, tables, alias_seq, registry=reg)
ctx = ExpansionContext(
node, resolution, target, tables, alias_seq, registry=reg,
finalizers=finalizers,
)
replacement = fn(node, ctx)
if not isinstance(replacement, exp.Expression):
raise TypeError(
Expand All @@ -601,6 +664,20 @@ def expand_operators(
if replacement is not node:
node.replace(replacement)

# Apply any query-level finalizers an expander registered, in registration
# order, once every node-local replacement is in place. A finalizer receives
# the current root and returns the (possibly new) root to thread forward.
for finalize in finalizers:
expression = finalize(expression)
if not isinstance(expression, exp.Expression):
# Mirror the node-local return guard: a finalizer that forgets to
# return the root would otherwise make the pass silently return the
# non-Expression far from the cause.
raise TypeError(
f"statement finalizer {finalize!r} returned "
f"{type(expression).__name__}, not exp.Expression"
)

return expression


Expand Down
8 changes: 8 additions & 0 deletions src/giql/expanders/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,14 @@ def expand_cluster(node: GIQLCluster, ctx: ExpansionContext) -> exp.Expression:
# re-run the pass over the restructured SELECT to expand any sibling pass-3
# operators (spatial predicates, DISTANCE) carried into it. Safe from
# recursion: the CLUSTER node is already replaced by its SUM window. (#144 B1)
#
# The nested pass's return value is intentionally discarded: expand_operators
# returns a new root only when a registered statement finalizer wraps it, and
# the sole finalizer-registering operator (a correlated NEAREST fallback) is
# already a plain join by this re-walk (deepest-first), so this run registers
# none and returns `select` unchanged. A NEAREST fallback whose reserved
# columns are re-surfaced by this enclosing CLUSTER `SELECT *` is a documented
# residual (it is not wrapped), not a lost root.
expand_operators(select, ctx.target, ctx.tables, ctx.registry)
return node

Expand Down
7 changes: 7 additions & 0 deletions src/giql/expanders/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,13 @@ def expand_merge(node: GIQLMerge, ctx: ExpansionContext) -> exp.Expression:
# subquery; the originals the pass collected are now unreachable, so re-run the
# pass over the restructured SELECT to expand any sibling pass-3 operators
# carried into it. Safe from recursion: the MERGE is already gone. (#144 B1)
#
# The nested pass's return value is intentionally discarded: expand_operators
# returns a new root only when a registered statement finalizer wraps it, and
# the sole finalizer-registering operator (a correlated NEAREST fallback) is
# already a plain join by this re-walk (deepest-first), so this run registers
# none and returns `select` unchanged. (MERGE's final projection is explicit,
# so a nested NEAREST fallback never surfaces its reserved columns here anyway.)
expand_operators(select, ctx.target, ctx.tables, ctx.registry)
return node

Expand Down
Loading
Loading