Releases: fiberseq/FiberHMM
v2.12.0: circular molecule mode + inference refactor
New features
--circular on fiberhmm-call
New flag for circular molecules (plasmids, organelles). FiberHMM 3x-tiles each read internally and projects HMM footprint/TF calls back to molecule coordinates so output ns/nl/as/al stay in [0, read_length). Wrapped MA annotations are split into two clipped pieces that share an AN:Z name so downstream consumers can re-stitch them.
Note: FiberBrowser support for visualizing circular annotations (wrapped features,
AN/circIdgrouping) is coming in a near-term release.
--circular-groups on fiberhmm-extract
New flag emits BED12+5 rows whose final columns are circId / circPart / circParts / molStart / molLength. autoSQL gains a matching .circ.as variant so generated bigBeds advertise the schema.
Internal refactor
fiberhmm/inference/parallel.py shrunk from 2898 lines to 192 lines: a thin dispatcher that re-exports every previously-private symbol. The streaming pipeline, region pipeline, legacy pipeline, workers, mp context, tagging, fused stages, and read-filter logic now live in dedicated modules. All existing import paths continue to work.
Performance
- Single-pass numba DAF and m6A context encoders, with vectorized fallback retained as equivalence oracles (5-12x speedup on encode).
- Viterbi memory traffic reduced via rolling state scores.
- Read-only HMM log-probabilities frozen for inference workers (~2x faster warm prediction).
- Numba-backed footprint run scanner with numpy fallback.
Stability
- Posterior writers, fused DAF reference handles, score-DB connections, stats BAM handles, extract worker BEDs, tagged-BAM BED helpers, HDF5 writers, and TSV writers all close cleanly on failure paths.
- Streaming and fused workers now report per-read failure counts to the drain while still passing failed reads through unchanged.
- Region-parallel temp directories are now per-run unique.
Bug fix
set_legacy_apply_tags now strips stale MA/AN/AQ when fiberhmm-apply runs over a BAM previously processed by fiberhmm-call/recall-tfs. Prior to this, a re-apply pass would leave spec tags pointing at superseded ns/nl coordinates.
Tests
427 correctness tests pass (was 297 at the start of the refactor), 26 benchmarks pass, ruff clean. New end-to-end coverage for --circular (streaming + region-parallel) and extract --circular-groups.
v2.11.0: fix nanopore reverse-strand m6A encoding
Bug fix
Nanopore-fiber mode silently dropped all m6A on reverse-aligned reads, causing every reverse read to be called as one whole-read nucleosome regardless of actual m6A density. Forward-aligned reads were unaffected. For typical ONT fiber-seq BAMs where ~50% of reads are reverse-aligned, this means half the data was producing incorrect calls.
Root cause
The nanopore encoder only looked for A positions in SAM SEQ (target_int=0, do_rc=False). For reverse-aligned reads, SAM SEQ is the reverse-complement of the basecalled-forward strand — basecalled A's appear as T's. The MM/ML parser correctly converted mod positions to SEQ frame (v2.9 fix), but those positions landed on T's which the encoder skipped entirely.
Fix
encode_from_query_sequence() now accepts an is_reverse parameter. In nanopore-fiber mode with is_reverse=True, it encodes T positions with RC context to recover the basecalled-forward emission space. Forward reads continue to encode A positions with forward context as before. PacBio-fiber and DAF modes are unaffected.
Workaround for older versions
Use --seq pacbio instead of --seq nanopore — PacBio mode already handles both A and T positions with RC, giving strand-symmetric calls (though with slightly miscalibrated emissions).
Tests
5 new regression tests validate forward/reverse strand symmetry in nanopore encoding mode.
All 297 tests pass.
v2.10.5: fix silent extract segfault on unusual MM/ML layouts
Bug fix
fiberhmm-extract --all produced silently empty output on certain BAMs from pangenome-graph-surjection pipelines (e.g. vg surject onto linear reference). Root cause was a C-level segfault inside pysam.AlignedSegment.modified_bases on reads with unusual MM/ML layouts — invisible because the SIGSEGV kills the worker before Python's exception handler can run, and the main process reports it blandly as "A process in the process pool was terminated abruptly" with zero rows emitted per region.
Localized via faulthandler: segfault at _extract_m6a line 505 during read.modified_bases. pysam has had this class of bug intermittently for years; we already worked around it in v2.9.0 for the apply/call path by using our own numpy MM/ML parser (fiberhmm.core.bam_reader.parse_mm_ml_per_mod_type, validated against pysam on 500+ real reads).
Fix
Swap read.modified_bases for the safe parser in all three extract sites:
_extract_m6a_extract_m5c_extract_deampriority-1 (MM/MLudU code)
No API change; output format unchanged. Single-char mod codes work identically. The rare ChEBI 55797 numeric code is no longer auto-detected in extract (explicitly unsupported — users needing it can rely on the R/Y or MD fallback paths, or run samtools calmd + fiberhmm-daf-encode).
Side fix: worker exception handler now flushes stderr explicitly so Python-level errors surface during multiprocessing crashes instead of getting buffer-swallowed.
Verified on real surjected fiber-seq CRAMs
Full pipeline: CRAM → BAM → fiberhmm-call → ft fire → fiberhmm-extract --all -c 8 --block-scores
Both of two test samples now produce full output across all 5 tracks (footprint / msp / tf / m6a / m5c), each around 1.8M rows. One of the two samples was silently emitting zero rows on v2.10.4 with identical config; v2.10.5 fixes it.
Tests
292 pass. Three deam priority-1 tests rewritten to construct real MM/ML tag strings instead of faking modified_bases dicts.
Install
pip install --upgrade fiberhmm
v2.10.4: auto-skip --deam on fiber-seq BAMs
Bug fix
fiberhmm-extract --all (default mode) now auto-skips --deam when the input BAM looks like fiber-seq.
What was happening
On vg-surjected fiber-seq CRAMs (pangenome-graph aligned, projected onto linear reference):
--allmode ran every track including--deam— even on fiber-seq data that has no deaminations (Hia5 is a methyltransferase).--deampath-3 (MD walker) fell through on every read because paths 1 (MM/MLucode) and 2 (R/Y) had nothing to match.- Surjected BAMs frequently have MDs that disagree with CIGAR on ~97% of reads (documented upstream issue in
vg surject). - pysam's
get_aligned_pairs(with_seq=True)on bad MD raises AssertionError AND corrupts malloc state before propagating. - A later allocation in the worker segfaults. Python's
finallyblock doesn't run on segfault, so buffered m6a/m5c BED writes are lost. - Result:
--deamis empty, andm6a/m5cfiles are also empty or missing entirely.
Fix
Auto-skip --deam on BAMs where:
- MM/ML carries m6A (
A+a/T-a) or 5mC (C+m) subtypes, AND - no dU code (
+u/+55797), AND - no R/Y IUPAC in the query sequence
When triggered, prints:
Auto-skipping --deam: BAM looks like fiber-seq (MM/ML has m6A/5mC subtypes,
no dU code, no R/Y). Hia5 is a methyltransferase, not a deaminase --
no deamination calls to extract. Pass --deam explicitly to override.
Explicit --deam is still honored — the auto-skip only fires in --all / default mode.
Verified
Tested on a 5 Mb chr1 slice of a real surjected fiber-seq CRAM: 29,679 reads → 22.8M m6a + 4.3M m5c + 1.9M footprint + 1.9M msp positions, zero crashes, all bigBeds produced. Pre-fix this would have segfaulted mid-region and lost m6a/m5c writes.
Tests
294 pass. 8 new cover the fiber-seq signature detection across PacBio modern MM/ML, SAM 1.7 suffix variants (A+a., C+m?), DAF IUPAC, modkit dU, ChEBI 55797, and mixed cases.
Install
pip install --upgrade fiberhmm
v2.10.3: README promotes one-step DAF pipeline (docs-only)
Docs-only release
Three README updates so the documentation matches the v2.10.0+ reality:
-
Streaming Pipelines section — replaced the stale DAF example (
minimap2 | daf-encode | apply) with the one-pass form:minimap2 --MD -a ref.fa reads.fq | samtools view -b | \ fiberhmm-call --mode daf --enzyme dddb -i - -o - | \ ft fire - final.bam
-
fiberhmm-daf-encodeCLI reference — added an "Optional in v2.10.0+" banner. The tool still exists and still works; most users just don't need it anymore. -
Tag reference — clarified that
st:Zis written only byfiberhmm-daf-encode(the two-pass path), not byfiberhmm-call --mode daf(the one-pass path derives strand internally from MD).
No code changes. No behavior changes. 292 tests still pass.
Install
pip install --upgrade fiberhmm
v2.10.2: surface stale-MD warning (extract + call)
What's new
Both fiberhmm-extract and fiberhmm-call --mode daf now print a clear note when they detect MD tags that disagree with their CIGAR (common on consensus BAMs where MD is stale after CIGAR was recomputed). Example:
NOTE: 11/20 of the sniffed reads have MD tags that
disagree with their CIGAR (typical of consensus BAMs where
MD was inherited stale after CIGAR was recomputed).
These reads will be SKIPPED for --deam only; other tracks
extract normally. The reads themselves are not corrupt --
the CIGAR is still correct, only the MD annotation is stale.
To recover --deam calls for those reads, regenerate MD with:
samtools calmd -b aligned.bam ref.fa > fixed.bam
v2.10.1 silently skipped those reads. That left users unsure whether their data was corrupt and whether anything had been dropped. v2.10.2 makes it explicit, up-front, and actionable.
Warning only fires when --deam actually depends on MD (no R/Y in sequence, no MM/ML dU code, no --reference on the call side) AND at least one sniffed read has the mismatch. Free — reuses the v2.10.1 validator. No code path changes beyond printing the note.
Install
pip install --upgrade fiberhmm
v2.10.1: fix malloc crash on BAMs with malformed MD tags
Bug fix
fiberhmm-extract (and fiberhmm-call --mode daf on v2.10.0) crashed on BAMs where the MD tag's encoded reference length disagreed with the CIGAR's reference-consuming op total. Classic consensus-BAM bug: MD inherits stale state after a re-alignment updates CIGAR.
Symptom on v2.9.7:
malloc(): invalid size (unsorted)
Why
pysam.get_aligned_pairs(with_seq=True) raises AssertionError on MD/CIGAR mismatch — but before raising, pysam has already corrupted its internal malloc state. So v2.9.6 crashed immediately with the assertion; v2.9.7 swallowed the assertion but the heap was already poisoned, surfacing as a malloc(): invalid size on some unrelated allocation later in the worker.
Fix
Pre-validate the MD tag against the CIGAR in pure Python before asking pysam to parse it. If they disagree, skip the get_aligned_pairs(with_seq=True) call entirely. Never triggers pysam's AssertionError path, never corrupts malloc.
Patched both call sites:
fiberhmm/cli/extract_tags.py—_extract_deampath-3fiberhmm/daf/encoder.py—get_daf_positions(v2.10.0's--mode dafMD fallback)
Workaround for affected BAMs (optional)
If you want --deam calls to populate for reads with broken MD, regenerate clean MD tags:
samtools calmd -b aligned.bam ref.fa > fixed.bam
Without this, v2.10.1 silently skips the broken reads for --deam extraction (every other track still works for those reads). With v2.10.0's --mode daf, you can also pass --reference ref.fa as a fallback.
Tests
292 passing. 3 new tests pin the MD parser, the validator, and the invariant that malformed MD bypasses the pysam call completely.
Install
pip install --upgrade fiberhmm
v2.10.0: fiberhmm-call --mode daf auto-detects MD tags (no encode step)
Highlights
fiberhmm-call --mode daf now works directly on raw DAF BAMs. The two-pass fiberhmm-daf-encode | fiberhmm-call pipeline collapses into one.
What changed
Per-read auto-detection in --mode daf. Priority:
- R/Y IUPAC codes in the stored sequence (from
fiberhmm-daf-encode) — existing fast path, unchanged. - MD tag on the aligned read — parsed on the fly, skips the encode step entirely.
--reference REF.fa— new CLI flag, fallback when the BAM has neither R/Y nor MD.
First non-empty source wins per read.
Correctness
Output is byte-identical to the two-pass pipeline. Verified on a real 3,929-read consensus DAF BAM (Christy LaFlamme's dataset):
| source | same HMM call bytes as two-pass? |
|---|---|
| ns (nucleosomes) | ✅ 0 / 3,929 diffs |
| nl (nuc lengths) | ✅ 0 / 3,929 diffs |
| as (MSPs) | ✅ 0 / 3,929 diffs |
| al (MSP lengths) | ✅ 0 / 3,929 diffs |
| nq / aq (quality) | ✅ 0 / 3,929 diffs |
| MA / AQ (spec tags) | ✅ 0 / 3,929 diffs |
And zero regression on R/Y-encoded BAMs: v2.9.6 vs v2.10.0 on the same pre-encoded BAM = 0 diffs.
New CLI
fiberhmm-call --mode daf --enzyme dddb --region-parallel \
-i raw_aligned.bam -o recalled.bam # MD-tag path
fiberhmm-call --mode daf --enzyme dddb --region-parallel \
-i raw_aligned.bam -o recalled.bam --reference ref.fa # FASTA fallback
On startup, --mode daf does a fast-fail sniff: it checks the first 10 mapped reads for R/Y, MD, or --reference. If none are available, it errors out in under a second with a clear message instead of silently skipping every read.
Backward compatibility
fiberhmm-daf-encodestays available and unchanged. Still the right tool if you want R/Y IUPAC stamped into the stored sequence for R/Y-aware downstream tools.- R/Y-encoded BAMs continue through the existing fast path (zero cost for the new MD branch — it only fires when R/Y is absent).
- No new dependencies.
Install
pip install --upgrade fiberhmm
Tests
289 passing. 4 new unit tests covering the MD-fallback branch, payload plumbing, fast-fail sniff, and byte-identity gate.
v2.9.7: fix fiberhmm-extract crash on malformed MD tags
Bug fix
fiberhmm-extract now swallows pysam AssertionError when a read carries a malformed MD tag, skipping that read's deamination extraction instead of aborting the entire region worker.
Symptom (user-reported field crash):
Traceback (most recent call last):
File ".../fiberhmm/cli/extract_tags.py", line 214, in _extract_region_worker
n_features['deam'] += _extract_deam(...)
File ".../fiberhmm/cli/extract_tags.py", line 688, in _extract_deam
pairs = read.get_aligned_pairs(with_seq=True)
...
AssertionError: Invalid MD tag: MD length 37729 mismatch with CIGAR length 43799 and 9742 insertions
Root cause: the path-3 MD walker in _extract_deam (added in v2.9.6 for DAF-seq BAMs that only carry MD mismatches, no MM/ML and no R/Y) caught ValueError/AttributeError but not AssertionError, which is what pysam throws when MD/CIGAR lengths disagree internally.
Fix: broaden the exception handler to Exception. Affected reads now skip the deam track silently; all other tracks (footprint, msp, tf, m6a, m5c) still extract for those reads.
Regression test pinned: tests/test_extract_block_scores.py::test_deam_priority3_skips_read_on_pysam_assertion_error.
Install
pip install --upgrade fiberhmm
Affects only v2.9.6 (which introduced path-3). Earlier versions never touched MD tags, so they weren't exposed to this bug.
v2.9.6: tag-presence diagnostic + MD-mismatch deam fallback
Changes
Tag diagnostic in fiberhmm-extract
Before the heavy pass, fiberhmm-extract now sniffs the first 20 mapped reads and prints a summary of which tag sources were detected plus which tracks it predicts will populate:
Tag scan (first 20 mapped reads): MA/AQ=20/20, MM/ML=20/20 [A+a], R/Y-in-seq=20/20
will populate: footprint[ns/nl], msp[as/al], tf[MA/AQ tf+ annotations], m6a[MM/ML (A+a)], deam[R/Y in sequence]
will be empty: m5c
Or on a raw DAF BAM with nothing but MD:
Tag scan (first 20 mapped reads): MD-only=20/20
will populate: deam[MD-only (path-3 fallback)]
will be empty: footprint, msp, tf, m6a, m5c
--deam path-3 MD fallback (completes FiberBrowser parity)
Third priority source for deamination calls. When MM/ML-native (u / 55797) and IUPAC R/Y paths both come up empty, _extract_deam walks get_aligned_pairs(with_seq=True) and picks out deamination-direction mismatches: ref-C→read-T (flavor 1, Y/CT-dea) and ref-G→read-A (flavor 0, R/GA-dea). Matches FiberBrowser's BAM parser priority exactly.
Cross-check: on a 3,929-read raw consensus BAM (MD-only) vs the FiberHMM-processed version of the same reads (R/Y-encoded), the two extraction paths produced 99.997% agreement — 990,170 vs 990,144 deamination calls, 25 reads differing by ≤2 calls each (edge cases at indels).
Priority ordering (unchanged in flavor codes)
- MM/ML
uor ChEBI 55797 - IUPAC R/Y in the query sequence
- MD-tag ref mismatch (new)
First non-empty source wins per read. blockMod bytes: 0 = R/GA-dea, 1 = Y/CT-dea.
Tests
3 new tests cover the MD walker, MD-absent skip, and priority ordering. Full suite: 288 passing.
Install
pip install --upgrade fiberhmm