Description
Add a coverage-window sweep as a shape-conditional strategy inside the default GenericTarget INTERSECTS expander, for the point-vs-interval SEMI/ANTI (filtering / masking) class. This is the filtering-side counterpart to #167 (which handles the combining inner/outer default via the naive predicate); together they make the default INTERSECTS expander shape-aware across all join kinds.
For a point-like left A filtered against an interval set B (canonical 0-based half-open), the strategy emits pure standard SQL:
- merge endpoints into one event stream —
A points (delta 0), B starts (+1), B ends (-1);
- running coverage
SUM(delta) OVER (PARTITION BY chrom ORDER BY pos, <half-open tie-break>);
- keep
A rows where cov = 0 (ANTI) or cov > 0 (SEMI).
Reuse DISJOIN's breakpoint / half-open logic (expanders/disjoin.py, _build_disjoin_sql) for the tie-break so a point at a B start is covered and at a B end is not.
Shape gate. Fire the sweep only when (a) the join is a filtering shape — SEMI/ANTI JOIN, or a provably anti/semi spelling (NOT EXISTS / EXISTS, LEFT JOIN … IS NULL) — and (b) the left side is point-like (the genomic pseudo-column resolves to a single position / unit interval, or a point literal). Otherwise fall back to the naive SEMI/ANTI predicate. Interval-vs-interval SEMI/ANTI is not a single-window coverage lookup and uses the naive predicate.
Masked-aggregate fusion. When the filtering join feeds a downstream GROUP BY, fuse the coverage predicate into the aggregate (gate the GROUP BY by cov = 0 / > 0) so mask + tally run in one sorted pass with no materialized filtered relation.
This issue also provides the naive SEMI/ANTI predicate as the general filtering default — generalizing #167's naive-predicate technique to SEMI/ANTI (left-only output) — used as the fallback when the sweep's shape gate does not fire.
Motivation
The GenericTarget default should hold the best robustly-portable strategy per query shape. For point-vs-interval SEMI/ANTI (SNPs/variants/bases ANTI/SEMI JOIN an interval set — masking against a blacklist, keeping variants overlapping exons, excluding dbSNP sites), the naive predicate is portable in correctness but not in performance: on an engine without a range-join optimizer it degrades to a per-chromosome nested loop O(n·m).
The coverage sweep is:
O((n+m) log(n+m)) on any engine — sort + window are universally, efficiently implemented (SQL:2003), with no reliance on the engine's range-join / anti-join optimizer;
- free of data-dependent blowup — each interval is exactly 2 events, each point 1 event (no row inflation, unlike binned);
- engine-agnostic — no engine-specific features (unlike IEJoin).
So it is the correct default for this shape: it gives every target — DuckDB, DataFusion, or a plain SQL engine — a good masking plan out of the box, and it is the engine-independent fix for the original masking slowness (which came from IEJoin coercion, removed by #167).
Expected outcome
Relationships
Relevant code
src/giql/expanders/intersects.py — _column_join, _expand_spatial_op, expand_spatial_set (SEMI/ANTI filtering predicate rendering; the sweep is the point-masking specialization)
src/giql/expander.py — ExpanderRegistry, ExpansionContext (Target + Capabilities), has_override
src/giql/expanders/disjoin.py — _build_disjoin_sql breakpoint / coverage-filter / half-open logic to reuse
src/giql/resolver.py — genomic pseudo-column resolution (point-likeness detection for the shape gate)
src/giql/targets.py — GenericTarget, Capabilities
Out of scope
Description
Add a coverage-window sweep as a shape-conditional strategy inside the default
GenericTargetINTERSECTS expander, for the point-vs-interval SEMI/ANTI (filtering / masking) class. This is the filtering-side counterpart to #167 (which handles the combining inner/outer default via the naive predicate); together they make the default INTERSECTS expander shape-aware across all join kinds.For a point-like left
Afiltered against an interval setB(canonical 0-based half-open), the strategy emits pure standard SQL:Apoints (delta0),Bstarts (+1),Bends (-1);SUM(delta) OVER (PARTITION BY chrom ORDER BY pos, <half-open tie-break>);Arows wherecov = 0(ANTI) orcov > 0(SEMI).Reuse DISJOIN's breakpoint / half-open logic (
expanders/disjoin.py,_build_disjoin_sql) for the tie-break so a point at aBstart is covered and at aBend is not.Shape gate. Fire the sweep only when (a) the join is a filtering shape —
SEMI/ANTI JOIN, or a provably anti/semi spelling (NOT EXISTS/EXISTS,LEFT JOIN … IS NULL) — and (b) the left side is point-like (the genomic pseudo-column resolves to a single position / unit interval, or a point literal). Otherwise fall back to the naive SEMI/ANTI predicate. Interval-vs-interval SEMI/ANTI is not a single-window coverage lookup and uses the naive predicate.Masked-aggregate fusion. When the filtering join feeds a downstream
GROUP BY, fuse the coverage predicate into the aggregate (gate theGROUP BYbycov = 0/> 0) so mask + tally run in one sorted pass with no materialized filtered relation.This issue also provides the naive SEMI/ANTI predicate as the general filtering default — generalizing #167's naive-predicate technique to SEMI/ANTI (left-only output) — used as the fallback when the sweep's shape gate does not fire.
Motivation
The
GenericTargetdefault should hold the best robustly-portable strategy per query shape. For point-vs-interval SEMI/ANTI (SNPs/variants/basesANTI/SEMI JOINan interval set — masking against a blacklist, keeping variants overlapping exons, excluding dbSNP sites), the naive predicate is portable in correctness but not in performance: on an engine without a range-join optimizer it degrades to a per-chromosome nested loopO(n·m).The coverage sweep is:
O((n+m) log(n+m))on any engine — sort + window are universally, efficiently implemented (SQL:2003), with no reliance on the engine's range-join / anti-join optimizer;So it is the correct default for this shape: it gives every target — DuckDB, DataFusion, or a plain SQL engine — a good masking plan out of the box, and it is the engine-independent fix for the original masking slowness (which came from IEJoin coercion, removed by #167).
Expected outcome
GenericTargetemits the portable coverage-window sweep (UNION ALL+SUM() OVER+ filter), not the naive per-row range predicate.ANTIkeepscov = 0,SEMIkeepscov > 0; half-open boundaries correct (a point at aBstart is covered, at aBend is not), verified against a real engine (oracle / integration test).GROUP BY) fuses into a single sorted pass.(target, Intersects)override (e.g. Promote the DuckDB IEJoin INTERSECTS implementation to a registered (DuckDBTarget, Intersects) override expander #169) still supersedes it.Relationships
_column_joinmachinery, and the registered-default-expander pattern. The sweep is a shape-conditional strategy within the same default expander, with the naive predicate as its fallback.Relevant code
src/giql/expanders/intersects.py—_column_join,_expand_spatial_op,expand_spatial_set(SEMI/ANTI filtering predicate rendering; the sweep is the point-masking specialization)src/giql/expander.py—ExpanderRegistry,ExpansionContext(Target + Capabilities),has_overridesrc/giql/expanders/disjoin.py—_build_disjoin_sqlbreakpoint / coverage-filter / half-open logic to reusesrc/giql/resolver.py— genomic pseudo-column resolution (point-likeness detection for the shape gate)src/giql/targets.py—GenericTarget,CapabilitiesOut of scope