feat: optional BAM → CRAM checkpoint after MarkDup / BQSR#259
Open
abhi18av wants to merge 1 commit into
Open
Conversation
Pattern 1 of the BAM-checkpoints spec
(abc-universe/specs/active/magma-bam-checkpoints.md). Opt-in via
--bam_checkpoint_compression cram; default 'none' preserves current
behaviour byte-equivalently.
Architecture — drop-in module swap with identical channel shape
call_wf.nf currently runs SAMTOOLS_INDEX at the post-MarkDup-or-BQSR
junction, emitting (sampleName, *.bai, *.bam) which HC and the
minor-variants HC consume directly. With --bam_checkpoint_compression
cram, a new SAMTOOLS_VIEW_CRAM module runs instead and emits the
identical channel shape (sampleName, *.cram.crai, *.cram). Downstream
consumers are agnostic — GATK HaplotypeCaller, the minor-variants HC,
and LoFreq (htslib-backed) all read CRAM directly via `-R / -f <ref>`,
so the round-trip cost at consume time is zero.
This is the cleanest possible wiring: one if/else, no new emit shape,
no changes to any of the seven processes that read from this channel.
Files
modules/local/samtools/view_cram.nf NEW
`samtools view -C -T <ref> --output-fmt cram,version=3.1` then
`samtools index`. Lossless by default; quality-score binning is
opt-in via params.cram_lossy_qualities. Same `samtools` binary
MAGMA already uses (params.samtools_path).
workflows/call_wf.nf
Branches on params.bam_checkpoint_compression at the post-
MarkDup-or-BQSR junction. Routes the resulting `aligned_indexed_ch`
through GATK_HAPLOTYPE_CALLER, GATK_HAPLOTYPE_CALLER__MINOR_VARIANTS,
and LOFREQ_CALL__NTM — all htslib-backed, all read CRAM directly.
The cohort-emit `samtools_bam_ch` carries the same channel shape.
default_params.config
Adds:
bam_checkpoint_compression = 'none' // 'none' (default) | 'cram'
cram_lossy_qualities = false // lossless default
CHANGELOG.md
`unreleased` section with the rationale + flag matrix.
Defaults
`bam_checkpoint_compression = 'none'` keeps SAMTOOLS_INDEX in the
pipeline; the new module isn't even loaded into the DAG. Existing
invocations produce identical outputs (plain `.bam` + `.bai`) to
master. Opting in is a single flag.
Codec scope
CRAM only in this PR. genozip / zstd variants are spec'd
(magma-bam-checkpoints.md §1) and will land as separate opt-in PRs
once the CRAM path has run on a real cohort.
Open items (documented in spec, not in this PR)
- Round-trip equivalence test — compress → consume → compare HC
output against a non-CRAM baseline run on the same cohort. The
spec §8 conformance gate calls for this; not blocking this PR's
opt-in default but blocking any "switch to default cram" follow-up.
- SV path (structural_variants_analysis_wf.nf) uses its own
MarkDup / recal chain and is not touched by this PR. CRAM
extension to DELLY's BAM input is a follow-up if needed.
- Samplesheet routing for pre-compressed inputs (.cram on input) is
Pattern 4 of the spec; not in this PR.
Refs
- Spec: abc-universe/specs/active/magma-bam-checkpoints.md
- Sibling — XBS gVCF checkpoints (Pattern 3 of the GATK opts):
abc-universe/specs/active/xbs-variant-calling-gatk-optimizations.md §3
- htslib CRAM 3.x: https://samtools.github.io/hts-specs/CRAMv3.pdf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pattern 1 of the BAM-checkpoints spec (
abc-universe/specs/active/magma-bam-checkpoints.md) — optional BAM → CRAM checkpoint at the post-MarkDup-or-BQSR junction, opt-in via param, default off. Genozip / zstd variants will land as separate PRs once this has run on a real cohort.Architecture — drop-in module swap with identical channel shape
call_wf.nfcurrently runsSAMTOOLS_INDEXat the post-MarkDup-or-BQSR junction, emitting(sampleName, *.bai, *.bam)which HC and the minor-variants HC consume directly. With--bam_checkpoint_compression cram, a newSAMTOOLS_VIEW_CRAMmodule runs instead and emits the identical channel shape(sampleName, *.cram.crai, *.cram). Downstream consumers are agnostic — GATK HaplotypeCaller, the minor-variants HC, and LoFreq (htslib-backed) all read CRAM directly via-R / -f <ref>, zero round-trip cost at consume time.One if/else, no new emit shape, no changes to any of the 4 processes that read the channel.
Files
modules/local/samtools/view_cram.nfsamtools view -C -T <ref> --output-fmt cram,version=3.1+samtools index. Lossless by default; opt-in lossy quality binning viaparams.cram_lossy_qualities. Samesamtoolsbinary MAGMA already uses.workflows/call_wf.nfparams.bam_checkpoint_compressionat the post-MarkDup-or-BQSR junction. Routes the resultingaligned_indexed_chthrough HC + minor-variants HC + LoFreq + cohort emit.default_params.configbam_checkpoint_compression = 'none'(default) andcram_lossy_qualities = false.CHANGELOG.mdunreleasedentry with the flag matrix and rationale.Defaults
bam_checkpoint_compression = 'none'keepsSAMTOOLS_INDEXin the pipeline; the new module isn't even loaded into the DAG. Existing invocations produce identical outputs (plain.bam+.bai) to master. Opting in is a single flag.Pre-merge gates
nextflow lint modules/local/samtools/view_cram.nf— cleannextflow lint workflows/call_wf.nfshows a pre-existingaddParams()DSL2-deprecation warning at line 26 (the first include, untouched by this PR) — not introduced hereOpen items (documented in spec, not in this PR)
structural_variants_analysis_wf.nf) uses its own MarkDup / recal chain and is not touched by this PR. CRAM extension to DELLY's BAM input is a follow-up if needed..cramon input) is Pattern 4 of the spec.Refs
abc-universe/specs/active/magma-bam-checkpoints.mdabc-universe/specs/active/xbs-variant-calling-gatk-optimizations.md§3