Context
vcfkit currently skips left-alignment for multi-allelic indel records (records where alts.len() > 1). See docs/known_differences.md.
This matches bcftools norm without -m, but diverges from bcftools norm with joint alignment flags that trigger per-record left-shifting.
Adversarial test: tests/corpus/synthetic/multiallelic_polyA.vcf — a multi-allelic indel at position 9 of a poly-A tract, where both ALTs could theoretically be shifted left. Current behaviour (passthrough) is confirmed against bcftools in normalize_test::diff_multiallelic_polya_matches_bcftools_no_split.
What needs to be done
Implement joint multi-allelic left-alignment per:
Tan, Abecasis, Kang (2015). "Unified representation of genetic variants."
Bioinformatics 31(13):2202–2204. doi:10.1093/bioinformatics/btv112
The algorithm: left-align all ALTs jointly against the reference. Find the leftmost position P such that trim_and_extend(REF, ALT_k, P) is valid for every k simultaneously, then rewrite the record at P.
Acceptance criteria
Risk
This changes normalize output for multi-allelic indels that are not yet fully left-aligned. Flag prominently in v0.2 release notes as a behaviour change.
Context
vcfkit currently skips left-alignment for multi-allelic indel records (records where
alts.len() > 1). Seedocs/known_differences.md.This matches
bcftools normwithout-m, but diverges frombcftools normwith joint alignment flags that trigger per-record left-shifting.Adversarial test:
tests/corpus/synthetic/multiallelic_polyA.vcf— a multi-allelic indel at position 9 of a poly-A tract, where both ALTs could theoretically be shifted left. Current behaviour (passthrough) is confirmed against bcftools innormalize_test::diff_multiallelic_polya_matches_bcftools_no_split.What needs to be done
Implement joint multi-allelic left-alignment per:
The algorithm: left-align all ALTs jointly against the reference. Find the leftmost position P such that
trim_and_extend(REF, ALT_k, P)is valid for every k simultaneously, then rewrite the record at P.Acceptance criteria
crates/vcfkit-core/src/normalize.rsalts.len() > 1shortcut inleft_align_recorddiff_multiallelic_polya_matches_bcftools_no_splitpasses (currently shows known divergence)docs/known_differences.mdto remove this entryRisk
This changes normalize output for multi-allelic indels that are not yet fully left-aligned. Flag prominently in v0.2 release notes as a behaviour change.