Problem
When a user asks a query that needs to reference a specific element of a Number=R or Number=A field, the filter expression language has no syntax for it.
Example: "variants with exactly 20 alt reads" requires FORMAT/AD[1] == 20 (the second element of the allelic depths field). The current expression language only supports whole-field comparisons like FORMAT/AD == 20, which uses any-element semantics and matches if any allele depth equals 20.
The LLM currently handles this in one of two ways:
- Produces a misleading expression (e.g.
FORMAT/DP == 20) that approximates the intent but matches different records
- Gates at low confidence with a caveat — the correct behavior, but the user has no
-e workaround either
Failing queries from GiAB dogfood
- "variants with exactly 20 reads supporting the alt" →
FORMAT/AD has no indexing; expression matches total depth, not alt depth
- "biallelic SNPs" → cannot combine
INFO/varType value check (unknown values) with genotype shape
Possible fixes
- Add array indexing to the filter expression language —
FORMAT/AD[1] == 20, INFO/AF[0] < 0.01. This is the proper fix but requires parser + evaluator changes. Target v0.4.
- Document the limitation in
--ask examples — explain in filter.mdx that per-element queries need -e with bcftools-style expressions.
- Teach the LLM to always gate on array-indexing queries — add a rule to the system prompt: if the query requires indexing into a multi-value field, set confidence < 0.5. Already partially handled by calibration rules added in v0.3.0-alpha.3.
Workaround (current)
Use bcftools view for queries that need element-level access:
bcftools view -i 'FORMAT/AD[0:1] == 20' input.vcf
Related
docs/known_differences.md §3: schema-based grounding limitations
drafts/phase3-dogfood-log.md: GiAB HG001 session analysis
Problem
When a user asks a query that needs to reference a specific element of a
Number=RorNumber=Afield, the filter expression language has no syntax for it.Example: "variants with exactly 20 alt reads" requires
FORMAT/AD[1] == 20(the second element of the allelic depths field). The current expression language only supports whole-field comparisons likeFORMAT/AD == 20, which uses any-element semantics and matches if any allele depth equals 20.The LLM currently handles this in one of two ways:
FORMAT/DP == 20) that approximates the intent but matches different records-eworkaround eitherFailing queries from GiAB dogfood
FORMAT/ADhas no indexing; expression matches total depth, not alt depthINFO/varTypevalue check (unknown values) with genotype shapePossible fixes
FORMAT/AD[1] == 20,INFO/AF[0] < 0.01. This is the proper fix but requires parser + evaluator changes. Target v0.4.--askexamples — explain infilter.mdxthat per-element queries need-ewithbcftools-style expressions.Workaround (current)
Use
bcftools viewfor queries that need element-level access:bcftools view -i 'FORMAT/AD[0:1] == 20' input.vcfRelated
docs/known_differences.md§3: schema-based grounding limitationsdrafts/phase3-dogfood-log.md: GiAB HG001 session analysis