Skip to content

Relocate CLUSTER and MERGE into the operator-expander registry — Closes #144#162

Merged
conradbzura merged 1 commit into
mainfrom
144-relocate-cluster-merge-expanders
Jun 30, 2026
Merged

Relocate CLUSTER and MERGE into the operator-expander registry — Closes #144#162
conradbzura merged 1 commit into
mainfrom
144-relocate-cluster-merge-expanders

Conversation

@conradbzura

@conradbzura conradbzura commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Migrate the last two operators, CLUSTER and MERGE, off the pre-pass transformer chain and onto the ExpandOperators registry (epic #137), so every GIQL operator now dispatches through one mechanism. CLUSTER and MERGE are whole-query rewrites, so each expander navigates to its enclosing SELECT, restructures a detached copy, and transplants the result back onto the original SELECT in place — returning the operator node unchanged (a no-op replace). This root-preserving transplant is required because the canonical form puts the operator at the root SELECT, which has no parent to replace through. The pass's deepest-first walk subsumes the transformers' manual recursion into CTEs and subqueries, and the standalone ClusterTransformer / MergeTransformer classes and their pre-pass calls are removed.

Emitted SQL is semantically identical to the legacy output and now deterministic; the only intended byte difference is the injected lag_calc column order, which the legacy output left hash-seed-dependent. Because the in-place rewrite copies the enclosing WHERE/HAVING into the new inner subquery, the expander re-runs the pass over the restructured SELECT (threading the active registry through ExpansionContext) so a sibling operator carried into that subquery — a spatial predicate or DISTANCE in a filtered cluster/merge — is expanded rather than leaking to the generator.

A few shapes the legacy pipeline accepted only to emit non-executable or invalid SQL now raise a clear ValueError instead: CLUSTER and MERGE in one SELECT, multiple CLUSTER expressions in one SELECT (as multiple MERGE already did), and a CLUSTER/MERGE nested inside a larger projection expression. The MERGE GROUP BY chromosome term is now quoted, so a reserved-word custom chrom column emits valid SQL. No working query is affected.

Two non-blocking follow-ups are deferred: extracting the shared CLUSTER/MERGE toolkit into a neutral module (#163) and resolving genomic columns over a derived-table FROM (#164).

Closes #144

Proposed changes

  • New giql/expanders/cluster.py and giql/expanders/merge.py — generic (GenericTarget) expanders registered for GIQLCluster / GIQLMerge, porting the legacy two-level lag_calc and clustered-aggregation rewrites. MERGE composes CLUSTER's expand_cluster_query. Shared helpers (genomic_columns, transplant, find_projected, extract_stranded, reject_cluster_merge_mix, require_top_level_projection, GenomicColumns) form the cross-operator toolkit.
  • giql/expander.pyExpansionContext now carries the active registry, letting a whole-query rewrite re-enter expand_operators over the SELECT it just restructured.
  • giql/expressions.pyGIQL_EXPAND = True on GIQLCluster / GIQLMerge; they deliberately skip canonicalization and carry an empty resolution.
  • giql/resolver.pyGIQLCluster / GIQLMerge added to the _OPERATORS roster; module scope note reconciled to the post-migration state.
  • giql/transformer.py / giql/transpile.pyClusterTransformer / MergeTransformer and their pre-pass calls removed (−661 lines in transformer.py).
  • docs/dialect/aggregation-operators.rst — document the CLUSTER+MERGE-in-one-SELECT ValueError and the separate-queries workaround.

Test cases

# Test Suite Given When Then Coverage Target
1 TestClusterExpander A CLUSTER nested in a subquery, CTE, or UNION branch Transpiling It expands to the lag_calc form with no leaked operator Pass-walk recursion
2 TestClusterExpander An aliased or bare CLUSTER filtered by a spatial predicate in WHERE Transpiling The predicate copied into lag_calc is itself expanded — no leak B1 regression
3 TestClusterExpander A CLUSTER nested in a projection expression Transpiling It raises a top-level-projection ValueError A16 guard
4 TestClusterExpander Two CLUSTER expressions in one SELECT Transpiling It raises a multiple-CLUSTER ValueError A15 guard
5 TestClusterExpander An explicit-projection CLUSTER under two PYTHONHASHSEED values Transpiling in subprocesses Output is byte-identical Determinism
6 TestMergeExpander An aliased or bare MERGE filtered by a spatial predicate in WHERE Transpiling The copied predicate is expanded — no leak B1 regression
7 TestMergeExpander A MERGE over a Table whose chrom column is a reserved word Transpiling The GROUP BY chrom term is quoted A13 fix
8 TestMergeExpander CLUSTER and MERGE in one SELECT, or multiple/nested MERGE Transpiling It raises the matching ValueError Guard coverage
9 TestExpander* All nine operators migrated to the registry Running the pass No operator remains unmigrated; CLUSTER/MERGE rewrite in place Migration completeness

@conradbzura conradbzura self-assigned this Jun 30, 2026
Migrate the last two operators off the pre-pass transformer chain and onto
the ExpandOperators registry (epic #137), completing the operator migration
so every GIQL operator dispatches through one mechanism.

CLUSTER and MERGE are whole-query rewrites, so each expander navigates to its
enclosing SELECT and mutates it in place, returning the operator node
unchanged (a no-op replace) — the same pattern NEAREST's decorrelated fallback
uses, required because the canonical form puts the operator at the root SELECT,
which has no parent to replace through. The pass's deepest-first walk subsumes
the transformers' manual recursion into CTEs and subqueries. Columns are
derived from the FROM table, so the operators carry an empty resolution and
deliberately skip canonicalization. Emitted SQL is semantically identical to
the legacy output and now deterministic; the only intended byte difference is
the injected lag_calc column order, which the legacy output left
hash-seed-dependent.

Because the in-place rewrite copies the enclosing WHERE/HAVING into the new
inner subquery, the expander re-runs the pass over the restructured SELECT (the
active registry is threaded through ExpansionContext) so a sibling operator
carried into that subquery — a spatial predicate or DISTANCE in a filtered
cluster/merge — is expanded rather than leaking to the generator.

The standalone ClusterTransformer and MergeTransformer classes are removed and
the pre-pass calls are dropped from the transpile pipeline.

Some shapes that the legacy pipeline accepted only to emit non-executable or
invalid SQL now raise a clear ValueError instead: CLUSTER and MERGE in a single
SELECT (in-place mutation cannot express both), multiple CLUSTER expressions in
one SELECT (as multiple MERGE already did), and a CLUSTER or MERGE nested inside
a larger projection expression. The MERGE GROUP BY chromosome term is now quoted
like every other column reference, so a reserved-word custom chrom column emits
valid SQL. No working query is affected.

Synthesized identifiers that do not yet use the reserved __giql_ prefix are
left as-is for a follow-up; renaming them would change emitted SQL.
@conradbzura conradbzura force-pushed the 144-relocate-cluster-merge-expanders branch from 06db98c to 3950b19 Compare June 30, 2026 11:38
@conradbzura conradbzura marked this pull request as ready for review June 30, 2026 14:57
@conradbzura conradbzura merged commit 84379be into main Jun 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Relocate CLUSTER and MERGE into the operator-expander registry

1 participant