Skip to content

Releases: fiberseq/FiberHMM

v2.12.0: circular molecule mode + inference refactor

25 May 18:05

Choose a tag to compare

New features

--circular on fiberhmm-call

New flag for circular molecules (plasmids, organelles). FiberHMM 3x-tiles each read internally and projects HMM footprint/TF calls back to molecule coordinates so output ns/nl/as/al stay in [0, read_length). Wrapped MA annotations are split into two clipped pieces that share an AN:Z name so downstream consumers can re-stitch them.

Note: FiberBrowser support for visualizing circular annotations (wrapped features, AN/circId grouping) is coming in a near-term release.

--circular-groups on fiberhmm-extract

New flag emits BED12+5 rows whose final columns are circId / circPart / circParts / molStart / molLength. autoSQL gains a matching .circ.as variant so generated bigBeds advertise the schema.

Internal refactor

fiberhmm/inference/parallel.py shrunk from 2898 lines to 192 lines: a thin dispatcher that re-exports every previously-private symbol. The streaming pipeline, region pipeline, legacy pipeline, workers, mp context, tagging, fused stages, and read-filter logic now live in dedicated modules. All existing import paths continue to work.

Performance

  • Single-pass numba DAF and m6A context encoders, with vectorized fallback retained as equivalence oracles (5-12x speedup on encode).
  • Viterbi memory traffic reduced via rolling state scores.
  • Read-only HMM log-probabilities frozen for inference workers (~2x faster warm prediction).
  • Numba-backed footprint run scanner with numpy fallback.

Stability

  • Posterior writers, fused DAF reference handles, score-DB connections, stats BAM handles, extract worker BEDs, tagged-BAM BED helpers, HDF5 writers, and TSV writers all close cleanly on failure paths.
  • Streaming and fused workers now report per-read failure counts to the drain while still passing failed reads through unchanged.
  • Region-parallel temp directories are now per-run unique.

Bug fix

set_legacy_apply_tags now strips stale MA/AN/AQ when fiberhmm-apply runs over a BAM previously processed by fiberhmm-call/recall-tfs. Prior to this, a re-apply pass would leave spec tags pointing at superseded ns/nl coordinates.

Tests

427 correctness tests pass (was 297 at the start of the refactor), 26 benchmarks pass, ruff clean. New end-to-end coverage for --circular (streaming + region-parallel) and extract --circular-groups.

v2.11.0: fix nanopore reverse-strand m6A encoding

07 May 18:41

Choose a tag to compare

Bug fix

Nanopore-fiber mode silently dropped all m6A on reverse-aligned reads, causing every reverse read to be called as one whole-read nucleosome regardless of actual m6A density. Forward-aligned reads were unaffected. For typical ONT fiber-seq BAMs where ~50% of reads are reverse-aligned, this means half the data was producing incorrect calls.

Root cause

The nanopore encoder only looked for A positions in SAM SEQ (target_int=0, do_rc=False). For reverse-aligned reads, SAM SEQ is the reverse-complement of the basecalled-forward strand — basecalled A's appear as T's. The MM/ML parser correctly converted mod positions to SEQ frame (v2.9 fix), but those positions landed on T's which the encoder skipped entirely.

Fix

encode_from_query_sequence() now accepts an is_reverse parameter. In nanopore-fiber mode with is_reverse=True, it encodes T positions with RC context to recover the basecalled-forward emission space. Forward reads continue to encode A positions with forward context as before. PacBio-fiber and DAF modes are unaffected.

Workaround for older versions

Use --seq pacbio instead of --seq nanopore — PacBio mode already handles both A and T positions with RC, giving strand-symmetric calls (though with slightly miscalibrated emissions).

Tests

5 new regression tests validate forward/reverse strand symmetry in nanopore encoding mode.

All 297 tests pass.

v2.10.5: fix silent extract segfault on unusual MM/ML layouts

24 Apr 06:43

Choose a tag to compare

Bug fix

fiberhmm-extract --all produced silently empty output on certain BAMs from pangenome-graph-surjection pipelines (e.g. vg surject onto linear reference). Root cause was a C-level segfault inside pysam.AlignedSegment.modified_bases on reads with unusual MM/ML layouts — invisible because the SIGSEGV kills the worker before Python's exception handler can run, and the main process reports it blandly as "A process in the process pool was terminated abruptly" with zero rows emitted per region.

Localized via faulthandler: segfault at _extract_m6a line 505 during read.modified_bases. pysam has had this class of bug intermittently for years; we already worked around it in v2.9.0 for the apply/call path by using our own numpy MM/ML parser (fiberhmm.core.bam_reader.parse_mm_ml_per_mod_type, validated against pysam on 500+ real reads).

Fix

Swap read.modified_bases for the safe parser in all three extract sites:

  • _extract_m6a
  • _extract_m5c
  • _extract_deam priority-1 (MM/ML u dU code)

No API change; output format unchanged. Single-char mod codes work identically. The rare ChEBI 55797 numeric code is no longer auto-detected in extract (explicitly unsupported — users needing it can rely on the R/Y or MD fallback paths, or run samtools calmd + fiberhmm-daf-encode).

Side fix: worker exception handler now flushes stderr explicitly so Python-level errors surface during multiprocessing crashes instead of getting buffer-swallowed.

Verified on real surjected fiber-seq CRAMs

Full pipeline: CRAM → BAM → fiberhmm-call → ft fire → fiberhmm-extract --all -c 8 --block-scores

Both of two test samples now produce full output across all 5 tracks (footprint / msp / tf / m6a / m5c), each around 1.8M rows. One of the two samples was silently emitting zero rows on v2.10.4 with identical config; v2.10.5 fixes it.

Tests

292 pass. Three deam priority-1 tests rewritten to construct real MM/ML tag strings instead of faking modified_bases dicts.

Install

pip install --upgrade fiberhmm

v2.10.4: auto-skip --deam on fiber-seq BAMs

24 Apr 03:44

Choose a tag to compare

Bug fix

fiberhmm-extract --all (default mode) now auto-skips --deam when the input BAM looks like fiber-seq.

What was happening

On vg-surjected fiber-seq CRAMs (pangenome-graph aligned, projected onto linear reference):

  1. --all mode ran every track including --deam — even on fiber-seq data that has no deaminations (Hia5 is a methyltransferase).
  2. --deam path-3 (MD walker) fell through on every read because paths 1 (MM/ML u code) and 2 (R/Y) had nothing to match.
  3. Surjected BAMs frequently have MDs that disagree with CIGAR on ~97% of reads (documented upstream issue in vg surject).
  4. pysam's get_aligned_pairs(with_seq=True) on bad MD raises AssertionError AND corrupts malloc state before propagating.
  5. A later allocation in the worker segfaults. Python's finally block doesn't run on segfault, so buffered m6a/m5c BED writes are lost.
  6. Result: --deam is empty, and m6a/m5c files are also empty or missing entirely.

Fix

Auto-skip --deam on BAMs where:

  • MM/ML carries m6A (A+a / T-a) or 5mC (C+m) subtypes, AND
  • no dU code (+u / +55797), AND
  • no R/Y IUPAC in the query sequence

When triggered, prints:

Auto-skipping --deam: BAM looks like fiber-seq (MM/ML has m6A/5mC subtypes,
no dU code, no R/Y). Hia5 is a methyltransferase, not a deaminase --
no deamination calls to extract. Pass --deam explicitly to override.

Explicit --deam is still honored — the auto-skip only fires in --all / default mode.

Verified

Tested on a 5 Mb chr1 slice of a real surjected fiber-seq CRAM: 29,679 reads → 22.8M m6a + 4.3M m5c + 1.9M footprint + 1.9M msp positions, zero crashes, all bigBeds produced. Pre-fix this would have segfaulted mid-region and lost m6a/m5c writes.

Tests

294 pass. 8 new cover the fiber-seq signature detection across PacBio modern MM/ML, SAM 1.7 suffix variants (A+a., C+m?), DAF IUPAC, modkit dU, ChEBI 55797, and mixed cases.

Install

pip install --upgrade fiberhmm

v2.10.3: README promotes one-step DAF pipeline (docs-only)

23 Apr 17:01

Choose a tag to compare

Docs-only release

Three README updates so the documentation matches the v2.10.0+ reality:

  1. Streaming Pipelines section — replaced the stale DAF example (minimap2 | daf-encode | apply) with the one-pass form:

    minimap2 --MD -a ref.fa reads.fq | samtools view -b | \
        fiberhmm-call --mode daf --enzyme dddb -i - -o - | \
        ft fire - final.bam
  2. fiberhmm-daf-encode CLI reference — added an "Optional in v2.10.0+" banner. The tool still exists and still works; most users just don't need it anymore.

  3. Tag reference — clarified that st:Z is written only by fiberhmm-daf-encode (the two-pass path), not by fiberhmm-call --mode daf (the one-pass path derives strand internally from MD).

No code changes. No behavior changes. 292 tests still pass.

Install

pip install --upgrade fiberhmm

v2.10.2: surface stale-MD warning (extract + call)

23 Apr 15:43

Choose a tag to compare

What's new

Both fiberhmm-extract and fiberhmm-call --mode daf now print a clear note when they detect MD tags that disagree with their CIGAR (common on consensus BAMs where MD is stale after CIGAR was recomputed). Example:

NOTE: 11/20 of the sniffed reads have MD tags that
disagree with their CIGAR (typical of consensus BAMs where
MD was inherited stale after CIGAR was recomputed).
These reads will be SKIPPED for --deam only; other tracks
extract normally. The reads themselves are not corrupt --
the CIGAR is still correct, only the MD annotation is stale.
To recover --deam calls for those reads, regenerate MD with:
  samtools calmd -b aligned.bam ref.fa > fixed.bam

v2.10.1 silently skipped those reads. That left users unsure whether their data was corrupt and whether anything had been dropped. v2.10.2 makes it explicit, up-front, and actionable.

Warning only fires when --deam actually depends on MD (no R/Y in sequence, no MM/ML dU code, no --reference on the call side) AND at least one sniffed read has the mismatch. Free — reuses the v2.10.1 validator. No code path changes beyond printing the note.

Install

pip install --upgrade fiberhmm

v2.10.1: fix malloc crash on BAMs with malformed MD tags

23 Apr 15:39

Choose a tag to compare

Bug fix

fiberhmm-extract (and fiberhmm-call --mode daf on v2.10.0) crashed on BAMs where the MD tag's encoded reference length disagreed with the CIGAR's reference-consuming op total. Classic consensus-BAM bug: MD inherits stale state after a re-alignment updates CIGAR.

Symptom on v2.9.7:

malloc(): invalid size (unsorted)

Why

pysam.get_aligned_pairs(with_seq=True) raises AssertionError on MD/CIGAR mismatch — but before raising, pysam has already corrupted its internal malloc state. So v2.9.6 crashed immediately with the assertion; v2.9.7 swallowed the assertion but the heap was already poisoned, surfacing as a malloc(): invalid size on some unrelated allocation later in the worker.

Fix

Pre-validate the MD tag against the CIGAR in pure Python before asking pysam to parse it. If they disagree, skip the get_aligned_pairs(with_seq=True) call entirely. Never triggers pysam's AssertionError path, never corrupts malloc.

Patched both call sites:

  • fiberhmm/cli/extract_tags.py_extract_deam path-3
  • fiberhmm/daf/encoder.pyget_daf_positions (v2.10.0's --mode daf MD fallback)

Workaround for affected BAMs (optional)

If you want --deam calls to populate for reads with broken MD, regenerate clean MD tags:

samtools calmd -b aligned.bam ref.fa > fixed.bam

Without this, v2.10.1 silently skips the broken reads for --deam extraction (every other track still works for those reads). With v2.10.0's --mode daf, you can also pass --reference ref.fa as a fallback.

Tests

292 passing. 3 new tests pin the MD parser, the validator, and the invariant that malformed MD bypasses the pysam call completely.

Install

pip install --upgrade fiberhmm

v2.10.0: fiberhmm-call --mode daf auto-detects MD tags (no encode step)

22 Apr 18:29

Choose a tag to compare

Highlights

fiberhmm-call --mode daf now works directly on raw DAF BAMs. The two-pass fiberhmm-daf-encode | fiberhmm-call pipeline collapses into one.

What changed

Per-read auto-detection in --mode daf. Priority:

  1. R/Y IUPAC codes in the stored sequence (from fiberhmm-daf-encode) — existing fast path, unchanged.
  2. MD tag on the aligned read — parsed on the fly, skips the encode step entirely.
  3. --reference REF.fa — new CLI flag, fallback when the BAM has neither R/Y nor MD.

First non-empty source wins per read.

Correctness

Output is byte-identical to the two-pass pipeline. Verified on a real 3,929-read consensus DAF BAM (Christy LaFlamme's dataset):

source same HMM call bytes as two-pass?
ns (nucleosomes) ✅ 0 / 3,929 diffs
nl (nuc lengths) ✅ 0 / 3,929 diffs
as (MSPs) ✅ 0 / 3,929 diffs
al (MSP lengths) ✅ 0 / 3,929 diffs
nq / aq (quality) ✅ 0 / 3,929 diffs
MA / AQ (spec tags) ✅ 0 / 3,929 diffs

And zero regression on R/Y-encoded BAMs: v2.9.6 vs v2.10.0 on the same pre-encoded BAM = 0 diffs.

New CLI

fiberhmm-call --mode daf --enzyme dddb --region-parallel \
    -i raw_aligned.bam -o recalled.bam                           # MD-tag path

fiberhmm-call --mode daf --enzyme dddb --region-parallel \
    -i raw_aligned.bam -o recalled.bam --reference ref.fa        # FASTA fallback

On startup, --mode daf does a fast-fail sniff: it checks the first 10 mapped reads for R/Y, MD, or --reference. If none are available, it errors out in under a second with a clear message instead of silently skipping every read.

Backward compatibility

  • fiberhmm-daf-encode stays available and unchanged. Still the right tool if you want R/Y IUPAC stamped into the stored sequence for R/Y-aware downstream tools.
  • R/Y-encoded BAMs continue through the existing fast path (zero cost for the new MD branch — it only fires when R/Y is absent).
  • No new dependencies.

Install

pip install --upgrade fiberhmm

Tests

289 passing. 4 new unit tests covering the MD-fallback branch, payload plumbing, fast-fail sniff, and byte-identity gate.

v2.9.7: fix fiberhmm-extract crash on malformed MD tags

22 Apr 16:50

Choose a tag to compare

Bug fix

fiberhmm-extract now swallows pysam AssertionError when a read carries a malformed MD tag, skipping that read's deamination extraction instead of aborting the entire region worker.

Symptom (user-reported field crash):

Traceback (most recent call last):
  File ".../fiberhmm/cli/extract_tags.py", line 214, in _extract_region_worker
    n_features['deam'] += _extract_deam(...)
  File ".../fiberhmm/cli/extract_tags.py", line 688, in _extract_deam
    pairs = read.get_aligned_pairs(with_seq=True)
  ...
AssertionError: Invalid MD tag: MD length 37729 mismatch with CIGAR length 43799 and 9742 insertions

Root cause: the path-3 MD walker in _extract_deam (added in v2.9.6 for DAF-seq BAMs that only carry MD mismatches, no MM/ML and no R/Y) caught ValueError/AttributeError but not AssertionError, which is what pysam throws when MD/CIGAR lengths disagree internally.

Fix: broaden the exception handler to Exception. Affected reads now skip the deam track silently; all other tracks (footprint, msp, tf, m6a, m5c) still extract for those reads.

Regression test pinned: tests/test_extract_block_scores.py::test_deam_priority3_skips_read_on_pysam_assertion_error.

Install

pip install --upgrade fiberhmm

Affects only v2.9.6 (which introduced path-3). Earlier versions never touched MD tags, so they weren't exposed to this bug.

v2.9.6: tag-presence diagnostic + MD-mismatch deam fallback

21 Apr 16:44

Choose a tag to compare

Changes

Tag diagnostic in fiberhmm-extract

Before the heavy pass, fiberhmm-extract now sniffs the first 20 mapped reads and prints a summary of which tag sources were detected plus which tracks it predicts will populate:

Tag scan (first 20 mapped reads): MA/AQ=20/20, MM/ML=20/20 [A+a], R/Y-in-seq=20/20
  will populate: footprint[ns/nl], msp[as/al], tf[MA/AQ tf+ annotations], m6a[MM/ML (A+a)], deam[R/Y in sequence]
  will be empty: m5c

Or on a raw DAF BAM with nothing but MD:

Tag scan (first 20 mapped reads): MD-only=20/20
  will populate: deam[MD-only (path-3 fallback)]
  will be empty: footprint, msp, tf, m6a, m5c

--deam path-3 MD fallback (completes FiberBrowser parity)

Third priority source for deamination calls. When MM/ML-native (u / 55797) and IUPAC R/Y paths both come up empty, _extract_deam walks get_aligned_pairs(with_seq=True) and picks out deamination-direction mismatches: ref-C→read-T (flavor 1, Y/CT-dea) and ref-G→read-A (flavor 0, R/GA-dea). Matches FiberBrowser's BAM parser priority exactly.

Cross-check: on a 3,929-read raw consensus BAM (MD-only) vs the FiberHMM-processed version of the same reads (R/Y-encoded), the two extraction paths produced 99.997% agreement — 990,170 vs 990,144 deamination calls, 25 reads differing by ≤2 calls each (edge cases at indels).

Priority ordering (unchanged in flavor codes)

  1. MM/ML u or ChEBI 55797
  2. IUPAC R/Y in the query sequence
  3. MD-tag ref mismatch (new)

First non-empty source wins per read. blockMod bytes: 0 = R/GA-dea, 1 = Y/CT-dea.

Tests

3 new tests cover the MD walker, MD-absent skip, and priority ordering. Full suite: 288 passing.

Install

pip install --upgrade fiberhmm