Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 20 additions & 16 deletions docs/transpilation/schema-mapping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -175,24 +175,28 @@ If your data uses 1-based coordinates (like VCF or GFF), configure the

.. note::

**Non-canonical encodings currently require a DuckDB-compatible engine.**
**Non-canonical encodings emit a capability-driven canonicalization wrapper.**
When a table declares an encoding other than the default 0-based half-open
(for example ``coordinate_system="1based"`` or ``interval_type="closed"``),
GIQL canonicalizes its coordinates by wrapping the relation in a hidden CTE
that uses ``SELECT * REPLACE (...)`` syntax. That syntax is supported by
DuckDB, BigQuery, Snowflake, and ClickHouse, but **not** by PostgreSQL,
SQLite, or DataFusion. Tables in the default 0-based half-open encoding are
unaffected -- they take an identity fast path that emits portable SQL.

To target a non-``REPLACE`` engine today, store your data in 0-based
half-open form, or convert it explicitly in a CTE and reference that CTE
(which GIQL treats as already canonical). Such a CTE -- and any CTE or
subquery passed as an operator reference -- must project the canonical
``chrom`` / ``start`` / ``end`` columns; GIQL validates this contract at
transpile time and raises a ``ValueError`` naming the missing column(s)
rather than emitting SQL that fails with an engine ``column not found``
error. Making canonicalization emit portable SQL on every engine is tracked
in `#132 <https://github.com/abdenlab/giql/issues/132>`_.
GIQL canonicalizes its coordinates by wrapping the relation in a hidden CTE.
The wrapper's projection form is chosen from the target's capabilities: the
``"duckdb"`` target emits ``SELECT * REPLACE (...)`` (also supported by
BigQuery, Snowflake, and ClickHouse), while the generic (``dialect=None``)
and ``"datafusion"`` targets emit the portable ``SELECT * EXCEPT (start, end),
<start>, <end>`` form. The ``* EXCEPT`` form runs on ``* EXCEPT``-capable
engines (the DataFusion family) but is **not** SQL-92 and is **not**
DuckDB-runnable; it is row-equivalent to the ``* REPLACE`` form but re-appends
the recomputed interval columns at the end of the projection. Tables in the
default 0-based half-open encoding are unaffected -- they take an identity fast
path that emits portable SQL on every target.

Neither form is SQL-92. To target a strict SQL-92 engine (PostgreSQL, SQLite),
store your data in 0-based half-open form, or convert it explicitly in a CTE
and reference that CTE (which GIQL treats as already canonical). Such a CTE --
and any CTE or subquery passed as an operator reference -- must project the
canonical ``chrom`` / ``start`` / ``end`` columns; GIQL validates this contract
at transpile time and raises a ``ValueError`` naming the missing column(s)
rather than emitting SQL that fails with an engine ``column not found`` error.

Working with Point Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
128 changes: 83 additions & 45 deletions src/giql/canonicalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,31 +34,44 @@
tradeoff the epic calls out (only synthesize a wrapper when canonicalization
actually changes columns).

Engine portability (known limitation)
-------------------------------------
The wrapper projection uses ``SELECT * REPLACE (...)`` to canonicalize the
interval columns in place while passing every other source column through
untouched (the registry declares only the genomic columns, so an explicit
full-column projection is not available). ``* REPLACE`` is supported by DuckDB,
BigQuery, Snowflake, and ClickHouse, but **not** by PostgreSQL, SQLite, or
DataFusion — so a non-canonical encoding currently transpiles to
engine-incompatible SQL on those targets. Identity-encoded (default 0-based
half-open) relations are unaffected: they skip wrapping entirely and emit
portable SQL. Making the emit strategy dialect-aware (an explicit portable
projection when the target lacks ``REPLACE`` or the full schema is declared) is
tracked in https://github.com/abdenlab/giql/issues/132.
Engine portability (capability-driven, issue #145)
---------------------------------------------------
The wrapper projection canonicalizes the interval columns while passing every
other source column through untouched (the registry declares only the genomic
columns, so an explicit full-column projection is not available). The emit
strategy is chosen from the active target's :class:`~giql.targets.Capabilities`,
following the DISJOIN passthrough's precedent (#143) — the same two emit forms,
though this wrapper additionally accepts ``capabilities=None`` (a direct caller
default, see :func:`canonicalize_coordinates`):

* ``SELECT * REPLACE (...)`` when ``capabilities.supports_star_replace`` holds
(DuckDB / BigQuery / Snowflake / ClickHouse) — substitutes start/end in place,
preserving source column order;
* the portable ``SELECT * EXCEPT (start, end), <start>, <end>`` form otherwise
(the generic baseline / DataFusion family), which every ``* EXCEPT``-capable
engine plans. This form is row-equivalent but **not column-order-equivalent**:
``* EXCEPT`` drops the interval columns and re-appends the recomputed ones at
the end of the projection. It is also not SQL-92 and not DuckDB-runnable.

Identity-encoded (default 0-based half-open) relations are unaffected either way:
they skip wrapping entirely and emit portable SQL. The capability is threaded in
from :func:`giql.transpile.transpile` via the active target; a direct caller that
passes no capabilities defaults to the ``* REPLACE`` form (the historical
behavior). This finalizes the dialect-aware emit strategy formerly tracked by
https://github.com/abdenlab/giql/issues/132.

Gating (epic #114, step 6)
--------------------------
The pass is gated per operator by a ``GIQL_CANONICALIZE`` class attribute on the
operator's expression class. An operator opts in by setting
``GIQL_CANONICALIZE = True``; absent or ``False`` (the default for every operator
as of this issue) the pass ignores it entirely. The operator port issues — #122
(DISJOIN) and #123 (NEAREST / DISTANCE / predicates) — flip these flags as each
operator's emitter is moved off in-emitter canonicalization
(:mod:`giql.canonical`) and onto this pass's output. **With every flag off the
pass is a strict no-op and the emitted SQL is byte-identical**, so the existing
suite is the migration oracle.
``GIQL_CANONICALIZE = True``; absent or ``False`` the pass ignores it entirely.
The operator port issues — #122 (DISJOIN) and #123 (NEAREST / DISTANCE /
predicates) — flipped these flags as each operator's emitter moved off in-emitter
canonicalization (:mod:`giql.canonical`) and onto this pass's output. As of those
ports every migrated operator opts in by default, so the pass actively
synthesizes wrappers; an operator can still toggle its flag off (a test or a
not-yet-migrated operator), in which case the pass leaves it untouched and the
emitted SQL is byte-identical for it.

De-canonicalization hook
-------------------------
Expand Down Expand Up @@ -124,6 +137,7 @@
from giql.resolver import ResolvedInterval
from giql.resolver import ResolvedRef
from giql.table import Table
from giql.targets import Capabilities

__all__ = [
"CANON_PREFIX",
Expand All @@ -149,7 +163,9 @@
)


def canonicalize_coordinates(expression: exp.Expression) -> exp.Expression:
def canonicalize_coordinates(
expression: exp.Expression, capabilities: Capabilities | None = None
) -> exp.Expression:
"""Synthesize canonical wrapper CTEs for non-canonical operator operands.

Walks *expression* for opted-in GIQL operators (those whose expression class
Expand All @@ -159,20 +175,28 @@ def canonicalize_coordinates(expression: exp.Expression) -> exp.Expression:
half-open coordinates and rewrites the slot (AST node + ``ResolvedRef``
metadata) to point at the canonical CTE.

The pass mutates and returns *expression* in place. **When no operator opts
in — the state as of issue #121 — it is a strict no-op: no node is touched
and the emitted SQL is byte-identical.**
The pass mutates and returns *expression* in place. For an operator whose
``GIQL_CANONICALIZE`` flag is off, or whose operands are already in the
canonical 0-based half-open encoding, it touches nothing and leaves the
emitted SQL byte-identical.

Parameters
----------
expression : exp.Expression
The pass-1-annotated AST.
capabilities : Capabilities | None
The active target's capabilities, used to choose the wrapper projection's
emit strategy (``* REPLACE`` vs the portable ``* EXCEPT`` form — see the
module docstring). :func:`giql.transpile.transpile` passes the active
target's capabilities; ``None`` (a direct caller) defaults to the
``* REPLACE`` form, preserving the historical behavior.

Returns
-------
exp.Expression
The same *expression*, with canonical wrapper CTEs inserted and migrated
operator slots rewritten (none, while every flag is off).
The same *expression*, with canonical wrapper CTEs inserted and the
opted-in operator slots that reference non-canonical tables rewritten to
point at them.
"""
# Column / interval operands (DISTANCE, predicates, NEAREST's non-table
# reference) canonicalize their metadata in place; this is independent of the
Expand All @@ -192,7 +216,7 @@ def canonicalize_coordinates(expression: exp.Expression) -> exp.Expression:
new_ctes: list[exp.CTE] = []

for node, arg, ref in targets:
body = _canonical_projection(ref)
body = _canonical_projection(ref, capabilities)
body_sql = body.sql()
name = body_to_name.get(body_sql)
if name is None:
Expand Down Expand Up @@ -394,15 +418,28 @@ def _fresh_name(next_name, taken: set[str]) -> str:
return candidate


def _canonical_projection(ref: ResolvedRef) -> exp.Select:
def _canonical_projection(
ref: ResolvedRef, capabilities: Capabilities | None
) -> exp.Select:
"""Build the ``SELECT`` body that projects *ref*'s table to canonical form.

The projection is a **full-row passthrough**: ``SELECT *`` keeps every
physical column of the source relation, and a star ``REPLACE`` rewrites only
the two interval columns — ``start`` / ``end``, under their original physical
names — with the :mod:`giql.canonical` arithmetic for the table's declared
encoding. ``chrom`` and every non-interval column flow through the star
untouched.
physical column of the source relation, and only the two interval columns —
``start`` / ``end``, under their original physical names — are rewritten with
the :mod:`giql.canonical` arithmetic for the table's declared encoding.
``chrom`` and every non-interval column flow through the star untouched.

The emit strategy is chosen from *capabilities* (issue #145), following the
precedent of :func:`giql.expanders.disjoin._disjoin_passthrough` — the same
two emit forms, with an added ``capabilities is None`` arm for direct callers
(the passthrough always receives a concrete ``ctx.capabilities``):

* ``SELECT * REPLACE (...)`` when ``supports_star_replace`` holds (or no
capabilities are supplied) — substitutes the interval columns in place,
preserving source column order;
* the portable ``SELECT * EXCEPT (start, end), <start>, <end>`` form otherwise
— drops the interval columns from the star and re-appends them recomputed.
Row-equivalent but not column-order-equivalent, and not DuckDB-runnable.

The full row (rather than a bare ``chrom`` / ``start`` / ``end`` triple) is
required by table-function operators whose final projection passes the whole
Expand All @@ -417,19 +454,20 @@ def _canonical_projection(ref: ResolvedRef) -> exp.Select:
# Quote the interval identifiers: the canonical column names are physical and
# routinely reserved words (the default genomic layout's ``start`` / ``end``),
# so the executed wrapper must quote them.
star = exp.Star(
replace=[
exp.alias_(
_canonical_start_expr(start, table),
exp.to_identifier(start, quoted=True),
),
exp.alias_(
_canonical_end_expr(end, table),
exp.to_identifier(end, quoted=True),
),
]
start_id = exp.to_identifier(start, quoted=True)
end_id = exp.to_identifier(end, quoted=True)
start_proj = exp.alias_(_canonical_start_expr(start, table), start_id)
end_proj = exp.alias_(_canonical_end_expr(end, table), end_id)
if capabilities is None or capabilities.supports_star_replace:
star = exp.Star(replace=[start_proj, end_proj])
return exp.Select(expressions=[star]).from_(exp.to_table(relation))
# Portable form: drop the interval columns from the star and re-project them
# recomputed under their own names. EXCEPT removes them from the row; the
# trailing projections add them back in canonical form.
star = exp.Star(except_=[exp.column(start_id), exp.column(end_id)])
return exp.Select(expressions=[star, start_proj, end_proj]).from_(
exp.to_table(relation)
)
return exp.Select(expressions=[star]).from_(exp.to_table(relation))


def _canonical_start_expr(start: str, table: Table | None) -> exp.Expression:
Expand Down
20 changes: 17 additions & 3 deletions src/giql/expanders/nearest.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
from giql.generators.base import BaseGIQLGenerator
from giql.resolver import ResolvedInterval
from giql.resolver import ResolvedRef
from giql.targets import Capabilities
from giql.targets import GenericTarget

#: Reserved column names the window-function fallback synthesizes inside its
Expand Down Expand Up @@ -75,6 +76,7 @@ def _distance_and_filters(
table_name: str,
target_ref: ResolvedRef,
ref: ResolvedInterval,
capabilities: Capabilities,
ref_fragments: tuple[str, str, str, str | None] | None = None,
) -> tuple[str, str, list[str], str]:
"""Build the shared distance SQL, the qualified target columns, and WHERE.
Expand All @@ -86,6 +88,11 @@ def _distance_and_filters(
``giqlnearest_sql`` emitter exactly. Each form derives its deterministic
ORDER BY tiebreaker from the target columns itself.

``capabilities`` is the active target's :class:`~giql.targets.Capabilities`,
forwarded to :meth:`BaseGIQLGenerator._nearest_passthrough` to choose the
target's de-canonicalization emit form (``* REPLACE`` vs the portable
``* EXCEPT``); both call sites pass ``ctx.capabilities``.

``ref_fragments`` optionally overrides the reference ``(chrom, start, end,
strand)`` SQL fragments. The LATERAL form consumes the resolution's
outer-qualified fragments verbatim; the fallback passes fragments pointing at
Expand All @@ -98,7 +105,7 @@ def _distance_and_filters(

output_table = BaseGIQLGenerator._nearest_output_encoding(expression, target_ref)
passthrough = BaseGIQLGenerator._nearest_passthrough(
table_name, target_start, target_end, output_table
table_name, target_start, target_end, output_table, capabilities
)

if ref_fragments is not None:
Expand Down Expand Up @@ -189,7 +196,9 @@ def _lateral_form(
_abs_distance_expr,
where_clauses,
passthrough,
) = _distance_and_filters(expression, table_name, target_ref, ref)
) = _distance_and_filters(
expression, table_name, target_ref, ref, ctx.capabilities
)
where_sql = " AND ".join(where_clauses)
# The wrapping level reads the inner row's *bare* column names (the passthrough
# projected ``<target>.*``), so the tiebreaker qualifies them by the wrapper
Expand Down Expand Up @@ -358,7 +367,12 @@ def _fallback_form(
where_clauses,
passthrough,
) = _distance_and_filters(
expression, table_name, target_ref, ref, ref_fragments=ref_fragments
expression,
table_name,
target_ref,
ref,
ctx.capabilities,
ref_fragments=ref_fragments,
)

# Surface the reference-key columns so the rewritten join can match each
Expand Down
42 changes: 28 additions & 14 deletions src/giql/generators/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from giql.resolver import ResolvedRef
from giql.table import Table
from giql.table import Tables
from giql.targets import Capabilities


class BaseGIQLGenerator(Generator):
Expand Down Expand Up @@ -71,6 +72,7 @@ def _nearest_passthrough(
target_start: str,
target_end: str,
output_table: Table | None,
capabilities: Capabilities | None = None,
) -> str:
"""Project the target's full row, de-canonicalizing the interval columns.

Expand All @@ -80,10 +82,19 @@ def _nearest_passthrough(
through as a plain ``{table_name}.*`` — the byte-identical identity fast
path. When it is non-canonical the interval columns, canonical inside the
``__giql_canon_*`` CTE the target was rewritten to, are de-canonicalized
back into that encoding via a star ``REPLACE`` so the passed-through
interval matches the target's own convention. (Only non-canonical targets
are wrapped, so the ``REPLACE`` appears only where a canonical CTE already
shapes the SQL.)
back into that encoding so the passed-through interval matches the target's
own convention.

The emit strategy is chosen from *capabilities*, following the precedent
of :func:`giql.expanders.disjoin._disjoin_passthrough` (issue #145) — the
same two emit forms, with an added ``capabilities is None`` arm for direct
callers (in production the sole caller always passes ``ctx.capabilities``):

* ``{table_name}.* REPLACE (...)`` when ``supports_star_replace`` holds (or
no capabilities are supplied) — substitutes start/end in place;
* the portable ``{table_name}.* EXCEPT (start, end), <start>, <end>`` form
otherwise (the generic baseline / DataFusion family). Row-equivalent but
not column-order-equivalent, and not DuckDB-runnable.

:param table_name:
The relation the row is selected from (the canon CTE name when wrapped,
Expand All @@ -94,9 +105,11 @@ def _nearest_passthrough(
Physical end column name
:param output_table:
The target's declared :class:`~giql.table.Table`, or ``None``
:param capabilities:
The active target's :class:`~giql.targets.Capabilities`; ``None``
defaults to the ``* REPLACE`` form.
:return:
The passthrough projection fragment (``{table_name}.*`` or a star
``REPLACE``)
The passthrough projection fragment
"""
if output_table is None or (
output_table.coordinate_system == "0based"
Expand All @@ -105,15 +118,16 @@ def _nearest_passthrough(
return f"{table_name}.*"
pt_start = decanonical_start(f'{table_name}."{target_start}"', output_table)
pt_end = decanonical_end(f'{table_name}."{target_end}"', output_table)
# TODO(#142): this emits an unconditional ``* REPLACE`` (DuckDB-only).
# When DataFusion gains correlated LATERAL, adopt the capability branch the
# DISJOIN expander uses (``giql.expanders.disjoin._disjoin_passthrough``):
# ``* REPLACE`` where ``supports_star_replace`` holds, the portable
# ``* EXCEPT`` form otherwise, so a non-canonical NEAREST passthrough runs
# on the DataFusion family too.
if capabilities is None or capabilities.supports_star_replace:
return (
f"{table_name}.* REPLACE "
f'({pt_start} AS "{target_start}", {pt_end} AS "{target_end}")'
)
# Portable form for engines without ``* REPLACE`` (generic / DataFusion):
# drop the interval columns from the star and re-project them recomputed.
return (
f"{table_name}.* REPLACE "
f'({pt_start} AS "{target_start}", {pt_end} AS "{target_end}")'
f'{table_name}.* EXCEPT ("{target_start}", "{target_end}"), '
f'{pt_start} AS "{target_start}", {pt_end} AS "{target_end}"'
)

@staticmethod
Expand Down
Loading
Loading