chore: merge molecular annotation#109
Merged
Merged
Conversation
Added detailed specifications and examples for the MA tag format used in molecular annotation.
Added detailed explanations for molecular coordinates, strand information, and quality scores.
…a separate length tag creates a tag set that is ~1% larger.
The quality type indicator in the MA tag spec is now a multi-character
string (e.g., "PQ", "PQQP") instead of a single optional character.
Each character adds one quality value per annotation to the AQ array,
with "P" for phred-scaled and "Q" for linearly-scaled values. This
enables richer per-annotation metadata (e.g., separate confidence and
effect-size scores).
In Rust, QualityType is replaced by QualityScaling (single scale) and
QualitySpec (ordered list of scales). Annotation.quality becomes
Annotation.qualities (Vec<u8>). QualitySpec is constructed via
new(vec![...]), none(), or string parsing ("PQ".parse()). The phred()
and linear() shortcuts are removed in favor of explicit construction.
Python bindings and tests are updated to match, with quality_spec as
a string parameter. The AL tag (separate encoding) is no longer
exposed in Python.
Annotation type identity is keyed on name only. The MA tag's on-disk format still groups annotations by (name, strand, quality_spec) into sections, but multiple sections sharing the same name represent one logical type with strand-split annotations. Conflicting quality_spec for the same name is now explicitly malformed.
AnnotationType identity is now keyed on name alone. Strand becomes a property of each individual Annotation, so a single type may contain annotations on different strands. AnnotationType::new no longer takes a strand parameter. AnnotationType::add and Annotation::new now require strand as an explicit per-annotation argument. type_signature() now takes the strand for the section being emitted — strand is a section-level concern at serialization time, not a type-level property.
- add_annotation_type(name, quality_spec) now merges with existing same-name type when quality_spec matches, panics on conflict (use the new try_add_annotation_type for fallible behavior). - add_annotations takes strand as a per-batch parameter; routes through try_add_annotation_type so mixed-strand additions accumulate into one type. - from_tags merges multiple MA sections sharing the same name into one in-memory AnnotationType, with per-annotation strand carried over from each section header. Conflicting quality_spec for the same name now errors. - to_ma_string / to_aq_array / to_an_string / to_al_array group annotations by (name, strand) into sections via a new emission_sections helper. Empty types are dropped from emission. Section order is types in insertion order with sub-sections by Strand enum order (Forward, Reverse, Unknown). - AnnotationInfo.strand now reads from each Annotation rather than the type. Updates the example and crate docs to the new API.
… and per-annotation strand Mechanical sweep: every add_annotation_type / add_annotations / Annotation::new call site moves to the new signatures, with strand pushed onto each per-annotation .add(). Replaces test_conflicting_annotation_type_error (which asserted same-name + different-strand was a conflict, now wrong under the new model) with two focused tests: - test_add_annotations_conflicting_quality_spec_error (the new conflict) - test_add_annotations_mixed_strand_merges_into_one_type (the new behavior) Updates test_annotation_type_with_no_annotations to assert empty types drop from MA emission rather than emitting an empty section, aligning with the existing MA-tag regex. Adds tests pinning the new contract: - test_add_annotation_type_same_name_returns_existing - test_add_annotation_type_conflicting_quality_spec_panics - test_strand_lives_on_annotation_not_on_type - test_to_ma_string_groups_by_strand_within_type - test_from_tags_merges_strand_split_sections_into_one_type - test_from_tags_conflicting_quality_spec_for_same_name_errors - test_round_trip_strand_split_preserves_on_disk_form
- add_annotations now delegates to the new Rust add_annotations, which routes through try_add_annotation_type so mixed-strand additions to the same (name, quality_spec) accumulate into one in-memory type. Removes the manual conflict check that referenced the now-deleted AnnotationType.strand. - Replaces test_conflicting_annotation_type (asserted same-name+different-strand errors — wrong under the new model) with test_conflicting_quality_spec (the new conflict) and test_mixed_strand_same_name_merges (the new behavior). - Adds pixi configuration to pyproject.toml so `pixi run build` / `pixi run test` builds the extension via maturin and runs pytest in one step. Lockfile committed. Verification: 33 python tests pass via `pixi run test`.
adds simple retain method to enable arbitrary filtering of annotations using a closure
feat! solidify rust library
Move MM/ML tag parsing and emission into the MA library. Reads happen
inside MolecularAnnotations::from_record; writes inside to_record.
Downstream consumers (fibertools-rs) no longer need a parallel
MM/ML parser/writer.
Design highlights:
- New per-AnnotationType `encoding` field: Encoding::Ma | Encoding::MmMl
{ skip_flag }. Routes between MA tag set and MM/ML on serialize.
- Spec mod codes ("a", "m", "h", "76792", ...) are AnnotationType.name.
Skip-base char lives in Annotation.name; strand parsed from MM "+"/"-".
- Multi-mod groups (C+mh,...) split into one type per code on parse;
emitted as single-mod groups on write (spec sec.1.7 equivalence).
- Skip flag (".", "?") preserved per type.
- Parser only reachable via from_record (no public parse_mm_ml).
Constructor-only semantics make idempotency a non-concern by design.
- Per-type emission lives on AnnotationType (to_ma_parts, to_mm_ml_parts);
MolecularAnnotations::to_record is a thin assembler that matches on
encoding.
API changes:
- Rename existing Encoding (Inline/Separate) to MaEncoding on
MolecularAnnotations to free the name. Methods: ma_encoding(),
set_ma_encoding().
- New types: Encoding, SkipFlag, MaParts, MmGroup, MmMlParts.
- New AnnotationType API: encoding (field), set_encoding (panics on
non-empty type), is_mm_ml(), to_ma_parts, to_mm_ml_parts.
- New MolecularAnnotations API: mm_ml_types(), mm_ml_types_mut().
- from_record now parses MA/AL/AQ/AN AND MM/ML in one pass. Previously
it only initialized structural fields; MA tag parsing required a
separate from_tags call.
- to_record now emits both MA tag set and MM/ML.
Cargo: log, regex, lazy_static, bio added as optional deps gated
through the htslib feature.
Tests: 130 lib tests + 13 doc tests passing under --features htslib;
117 + 13 without. Round-trip and isolation tests verify MA and MM/ML
cannot cross-contaminate; cargo clippy --features htslib clean.
- Promote the rust crate to molecular-annotation/ (was rust/) - Move spec markdown + agent instructions into molecular-annotation/spec/ - Update molecular-annotation/python/Cargo.toml path dep to ".." - Drop stale per-crate Cargo.lock files (workspace lock is authoritative) - Declare molecular-annotation as a workspace member; exclude its python cdylib - Add a pointer from the crate README to spec/README.md Known limitation: `cargo test --workspace` does not currently work because molecular-annotation's rust-htslib 0.47 dev-dep conflicts with fibertools-rs's 0.46 via shared hts-sys bindings. Run `cargo test -p <crate>` separately until the integration branch unifies htslib versions.
The workspace now includes molecular-annotation as a member, but its rust-htslib 0.47 dev-dep conflicts with fibertools-rs's 0.46 via shared hts-sys bindings, and its source has pre-existing rustfmt drift. Pin fmt/clippy/test to fibertools-rs explicitly until the integration branch unifies htslib versions and reformats molecular-annotation.
mrvollger
approved these changes
May 25, 2026
cademirch
added a commit
to cademirch/fibertools-rs
that referenced
this pull request
May 29, 2026
Upstream main (PR fiberseq#109) vendored Molecular-annotation-spec as a workspace member at ./molecular-annotation; switch from the git dependency to a path dependency on the vendored copy. Introduce [workspace.dependencies] pinning rust-htslib at 0.46. Both fibertools-rs and molecular-annotation now consume it via `rust-htslib.workspace = true`. The previous mismatch (root at 0.47, vendored MA also 0.47, upstream root at 0.46) had htslib duplicated in the dep graph with incompatible types across the crate boundary.
cademirch
added a commit
to cademirch/fibertools-rs
that referenced
this pull request
Jun 2, 2026
Upstream main (PR fiberseq#109) vendored Molecular-annotation-spec as a workspace member at ./molecular-annotation; switch from the git dependency to a path dependency on the vendored copy. Introduce [workspace.dependencies] pinning rust-htslib at 0.46. Both fibertools-rs and molecular-annotation now consume it via `rust-htslib.workspace = true`. The previous mismatch (root at 0.47, vendored MA also 0.47, upstream root at 0.46) had htslib duplicated in the dep graph with incompatible types across the crate boundary.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is in support of #107. Brings the molecular annotation library (name tbd) into this workspace. Currently, this is purely superficial in that the code lives here but it is not integrated into CI, any release stuff, or otherwise. That will come in follow ups.