Skip to content

Add a portable coverage-sweep strategy to the default INTERSECTS expander for point-vs-interval SEMI/ANTI masking #170

Description

@conradbzura

Description

Add a coverage-window sweep as a shape-conditional strategy inside the default GenericTarget INTERSECTS expander, for the point-vs-interval SEMI/ANTI (filtering / masking) class. This is the filtering-side counterpart to #167 (which handles the combining inner/outer default via the naive predicate); together they make the default INTERSECTS expander shape-aware across all join kinds.

For a point-like left A filtered against an interval set B (canonical 0-based half-open), the strategy emits pure standard SQL:

  • merge endpoints into one event stream — A points (delta 0), B starts (+1), B ends (-1);
  • running coverage SUM(delta) OVER (PARTITION BY chrom ORDER BY pos, <half-open tie-break>);
  • keep A rows where cov = 0 (ANTI) or cov > 0 (SEMI).

Reuse DISJOIN's breakpoint / half-open logic (expanders/disjoin.py, _build_disjoin_sql) for the tie-break so a point at a B start is covered and at a B end is not.

Shape gate. Fire the sweep only when (a) the join is a filtering shape — SEMI/ANTI JOIN, or a provably anti/semi spelling (NOT EXISTS / EXISTS, LEFT JOIN … IS NULL) — and (b) the left side is point-like (the genomic pseudo-column resolves to a single position / unit interval, or a point literal). Otherwise fall back to the naive SEMI/ANTI predicate. Interval-vs-interval SEMI/ANTI is not a single-window coverage lookup and uses the naive predicate.

Masked-aggregate fusion. When the filtering join feeds a downstream GROUP BY, fuse the coverage predicate into the aggregate (gate the GROUP BY by cov = 0 / > 0) so mask + tally run in one sorted pass with no materialized filtered relation.

This issue also provides the naive SEMI/ANTI predicate as the general filtering default — generalizing #167's naive-predicate technique to SEMI/ANTI (left-only output) — used as the fallback when the sweep's shape gate does not fire.

Motivation

The GenericTarget default should hold the best robustly-portable strategy per query shape. For point-vs-interval SEMI/ANTI (SNPs/variants/bases ANTI/SEMI JOIN an interval set — masking against a blacklist, keeping variants overlapping exons, excluding dbSNP sites), the naive predicate is portable in correctness but not in performance: on an engine without a range-join optimizer it degrades to a per-chromosome nested loop O(n·m).

The coverage sweep is:

  • O((n+m) log(n+m)) on any engine — sort + window are universally, efficiently implemented (SQL:2003), with no reliance on the engine's range-join / anti-join optimizer;
  • free of data-dependent blowup — each interval is exactly 2 events, each point 1 event (no row inflation, unlike binned);
  • engine-agnostic — no engine-specific features (unlike IEJoin).

So it is the correct default for this shape: it gives every target — DuckDB, DataFusion, or a plain SQL engine — a good masking plan out of the box, and it is the engine-independent fix for the original masking slowness (which came from IEJoin coercion, removed by #167).

Expected outcome

Relationships

Relevant code

  • src/giql/expanders/intersects.py_column_join, _expand_spatial_op, expand_spatial_set (SEMI/ANTI filtering predicate rendering; the sweep is the point-masking specialization)
  • src/giql/expander.pyExpanderRegistry, ExpansionContext (Target + Capabilities), has_override
  • src/giql/expanders/disjoin.py_build_disjoin_sql breakpoint / coverage-filter / half-open logic to reuse
  • src/giql/resolver.py — genomic pseudo-column resolution (point-likeness detection for the shape gate)
  • src/giql/targets.pyGenericTarget, Capabilities

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or capability

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions