Skip to content

chore: merge molecular annotation#109

Merged
mrvollger merged 41 commits into
mainfrom
merge-molecular-annotation
May 25, 2026
Merged

chore: merge molecular annotation#109
mrvollger merged 41 commits into
mainfrom
merge-molecular-annotation

Conversation

@cademirch
Copy link
Copy Markdown
Contributor

This is in support of #107. Brings the molecular annotation library (name tbd) into this workspace. Currently, this is purely superficial in that the code lives here but it is not integrated into CI, any release stuff, or otherwise. That will come in follow ups.

mrvollger and others added 30 commits November 4, 2025 09:56
Added detailed specifications and examples for the MA tag format used in molecular annotation.
Added detailed explanations for molecular coordinates, strand information, and quality scores.
…a separate length tag creates a tag set that is ~1% larger.
The quality type indicator in the MA tag spec is now a multi-character
string (e.g., "PQ", "PQQP") instead of a single optional character.
Each character adds one quality value per annotation to the AQ array,
with "P" for phred-scaled and "Q" for linearly-scaled values. This
enables richer per-annotation metadata (e.g., separate confidence and
effect-size scores).

In Rust, QualityType is replaced by QualityScaling (single scale) and
QualitySpec (ordered list of scales). Annotation.quality becomes
Annotation.qualities (Vec<u8>). QualitySpec is constructed via
new(vec![...]), none(), or string parsing ("PQ".parse()). The phred()
and linear() shortcuts are removed in favor of explicit construction.

Python bindings and tests are updated to match, with quality_spec as
a string parameter. The AL tag (separate encoding) is no longer
exposed in Python.
Annotation type identity is keyed on name only. The MA tag's on-disk
format still groups annotations by (name, strand, quality_spec) into
sections, but multiple sections sharing the same name represent one
logical type with strand-split annotations.

Conflicting quality_spec for the same name is now explicitly malformed.
AnnotationType identity is now keyed on name alone. Strand becomes a
property of each individual Annotation, so a single type may contain
annotations on different strands.

AnnotationType::new no longer takes a strand parameter.
AnnotationType::add and Annotation::new now require strand as an
explicit per-annotation argument.

type_signature() now takes the strand for the section being emitted —
strand is a section-level concern at serialization time, not a
type-level property.
- add_annotation_type(name, quality_spec) now merges with existing same-name
  type when quality_spec matches, panics on conflict (use the new
  try_add_annotation_type for fallible behavior).
- add_annotations takes strand as a per-batch parameter; routes through
  try_add_annotation_type so mixed-strand additions accumulate into one type.
- from_tags merges multiple MA sections sharing the same name into one
  in-memory AnnotationType, with per-annotation strand carried over from each
  section header. Conflicting quality_spec for the same name now errors.
- to_ma_string / to_aq_array / to_an_string / to_al_array group annotations
  by (name, strand) into sections via a new emission_sections helper. Empty
  types are dropped from emission. Section order is types in insertion order
  with sub-sections by Strand enum order (Forward, Reverse, Unknown).
- AnnotationInfo.strand now reads from each Annotation rather than the type.

Updates the example and crate docs to the new API.
… and per-annotation strand

Mechanical sweep: every add_annotation_type / add_annotations / Annotation::new
call site moves to the new signatures, with strand pushed onto each
per-annotation .add().

Replaces test_conflicting_annotation_type_error (which asserted same-name +
different-strand was a conflict, now wrong under the new model) with two
focused tests:
  - test_add_annotations_conflicting_quality_spec_error (the new conflict)
  - test_add_annotations_mixed_strand_merges_into_one_type (the new behavior)

Updates test_annotation_type_with_no_annotations to assert empty types drop
from MA emission rather than emitting an empty section, aligning with the
existing MA-tag regex.

Adds tests pinning the new contract:
  - test_add_annotation_type_same_name_returns_existing
  - test_add_annotation_type_conflicting_quality_spec_panics
  - test_strand_lives_on_annotation_not_on_type
  - test_to_ma_string_groups_by_strand_within_type
  - test_from_tags_merges_strand_split_sections_into_one_type
  - test_from_tags_conflicting_quality_spec_for_same_name_errors
  - test_round_trip_strand_split_preserves_on_disk_form
- add_annotations now delegates to the new Rust add_annotations, which
  routes through try_add_annotation_type so mixed-strand additions to the
  same (name, quality_spec) accumulate into one in-memory type. Removes the
  manual conflict check that referenced the now-deleted AnnotationType.strand.
- Replaces test_conflicting_annotation_type (asserted same-name+different-strand
  errors — wrong under the new model) with test_conflicting_quality_spec
  (the new conflict) and test_mixed_strand_same_name_merges (the new behavior).
- Adds pixi configuration to pyproject.toml so `pixi run build` /
  `pixi run test` builds the extension via maturin and runs pytest in one
  step. Lockfile committed.

Verification: 33 python tests pass via `pixi run test`.
adds simple retain method to enable arbitrary filtering of annotations using a closure
feat! solidify rust library
Move MM/ML tag parsing and emission into the MA library. Reads happen
inside MolecularAnnotations::from_record; writes inside to_record.
Downstream consumers (fibertools-rs) no longer need a parallel
MM/ML parser/writer.

Design highlights:
- New per-AnnotationType `encoding` field: Encoding::Ma | Encoding::MmMl
  { skip_flag }. Routes between MA tag set and MM/ML on serialize.
- Spec mod codes ("a", "m", "h", "76792", ...) are AnnotationType.name.
  Skip-base char lives in Annotation.name; strand parsed from MM "+"/"-".
- Multi-mod groups (C+mh,...) split into one type per code on parse;
  emitted as single-mod groups on write (spec sec.1.7 equivalence).
- Skip flag (".", "?") preserved per type.
- Parser only reachable via from_record (no public parse_mm_ml).
  Constructor-only semantics make idempotency a non-concern by design.
- Per-type emission lives on AnnotationType (to_ma_parts, to_mm_ml_parts);
  MolecularAnnotations::to_record is a thin assembler that matches on
  encoding.

API changes:
- Rename existing Encoding (Inline/Separate) to MaEncoding on
  MolecularAnnotations to free the name. Methods: ma_encoding(),
  set_ma_encoding().
- New types: Encoding, SkipFlag, MaParts, MmGroup, MmMlParts.
- New AnnotationType API: encoding (field), set_encoding (panics on
  non-empty type), is_mm_ml(), to_ma_parts, to_mm_ml_parts.
- New MolecularAnnotations API: mm_ml_types(), mm_ml_types_mut().
- from_record now parses MA/AL/AQ/AN AND MM/ML in one pass. Previously
  it only initialized structural fields; MA tag parsing required a
  separate from_tags call.
- to_record now emits both MA tag set and MM/ML.

Cargo: log, regex, lazy_static, bio added as optional deps gated
through the htslib feature.

Tests: 130 lib tests + 13 doc tests passing under --features htslib;
117 + 13 without. Round-trip and isolation tests verify MA and MM/ML
cannot cross-contaminate; cargo clippy --features htslib clean.
…4d47026f4f25'

git-subtree-dir: molecular-annotation
git-subtree-mainline: a5d68f6
git-subtree-split: e096dfb
- Promote the rust crate to molecular-annotation/ (was rust/)
- Move spec markdown + agent instructions into molecular-annotation/spec/
- Update molecular-annotation/python/Cargo.toml path dep to ".."
- Drop stale per-crate Cargo.lock files (workspace lock is authoritative)
- Declare molecular-annotation as a workspace member; exclude its python cdylib
- Add a pointer from the crate README to spec/README.md

Known limitation: `cargo test --workspace` does not currently work because
molecular-annotation's rust-htslib 0.47 dev-dep conflicts with fibertools-rs's
0.46 via shared hts-sys bindings. Run `cargo test -p <crate>` separately
until the integration branch unifies htslib versions.
The workspace now includes molecular-annotation as a member, but its
rust-htslib 0.47 dev-dep conflicts with fibertools-rs's 0.46 via shared
hts-sys bindings, and its source has pre-existing rustfmt drift. Pin
fmt/clippy/test to fibertools-rs explicitly until the integration branch
unifies htslib versions and reformats molecular-annotation.
@cademirch cademirch marked this pull request as ready for review May 25, 2026 19:28
@cademirch cademirch requested a review from mrvollger May 25, 2026 19:28
@mrvollger mrvollger merged commit ca24525 into main May 25, 2026
8 of 9 checks passed
cademirch added a commit to cademirch/fibertools-rs that referenced this pull request May 29, 2026
Upstream main (PR fiberseq#109) vendored Molecular-annotation-spec as a
workspace member at ./molecular-annotation; switch from the git
dependency to a path dependency on the vendored copy.

Introduce [workspace.dependencies] pinning rust-htslib at 0.46.
Both fibertools-rs and molecular-annotation now consume it via
`rust-htslib.workspace = true`. The previous mismatch (root at
0.47, vendored MA also 0.47, upstream root at 0.46) had htslib
duplicated in the dep graph with incompatible types across the
crate boundary.
cademirch added a commit to cademirch/fibertools-rs that referenced this pull request Jun 2, 2026
Upstream main (PR fiberseq#109) vendored Molecular-annotation-spec as a
workspace member at ./molecular-annotation; switch from the git
dependency to a path dependency on the vendored copy.

Introduce [workspace.dependencies] pinning rust-htslib at 0.46.
Both fibertools-rs and molecular-annotation now consume it via
`rust-htslib.workspace = true`. The previous mismatch (root at
0.47, vendored MA also 0.47, upstream root at 0.46) had htslib
duplicated in the dep graph with incompatible types across the
crate boundary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants