Skip to content

feat: optional BAM → CRAM checkpoint after MarkDup / BQSR#259

Open
abhi18av wants to merge 1 commit into
feature/abc-clusterfrom
feature/cram-checkpoint
Open

feat: optional BAM → CRAM checkpoint after MarkDup / BQSR#259
abhi18av wants to merge 1 commit into
feature/abc-clusterfrom
feature/cram-checkpoint

Conversation

@abhi18av

Copy link
Copy Markdown
Member

Summary

Pattern 1 of the BAM-checkpoints spec (abc-universe/specs/active/magma-bam-checkpoints.md) — optional BAM → CRAM checkpoint at the post-MarkDup-or-BQSR junction, opt-in via param, default off. Genozip / zstd variants will land as separate PRs once this has run on a real cohort.

Architecture — drop-in module swap with identical channel shape

call_wf.nf currently runs SAMTOOLS_INDEX at the post-MarkDup-or-BQSR junction, emitting (sampleName, *.bai, *.bam) which HC and the minor-variants HC consume directly. With --bam_checkpoint_compression cram, a new SAMTOOLS_VIEW_CRAM module runs instead and emits the identical channel shape (sampleName, *.cram.crai, *.cram). Downstream consumers are agnostic — GATK HaplotypeCaller, the minor-variants HC, and LoFreq (htslib-backed) all read CRAM directly via -R / -f <ref>, zero round-trip cost at consume time.

def aligned_indexed_ch
if (params.bam_checkpoint_compression == 'cram') {
    SAMTOOLS_VIEW_CRAM(recalibrated_bam_ch, params.ref_fasta, [params.ref_fasta_fai, params.ref_fasta_dict])
    aligned_indexed_ch = SAMTOOLS_VIEW_CRAM.out
} else {
    SAMTOOLS_INDEX(recalibrated_bam_ch)
    aligned_indexed_ch = SAMTOOLS_INDEX.out
}
// ...HC, minor-variants HC, LoFreq, samtools_bam_ch emit all read aligned_indexed_ch

One if/else, no new emit shape, no changes to any of the 4 processes that read the channel.

Files

Path Purpose
modules/local/samtools/view_cram.nf NEW. samtools view -C -T <ref> --output-fmt cram,version=3.1 + samtools index. Lossless by default; opt-in lossy quality binning via params.cram_lossy_qualities. Same samtools binary MAGMA already uses.
workflows/call_wf.nf Branches on params.bam_checkpoint_compression at the post-MarkDup-or-BQSR junction. Routes the resulting aligned_indexed_ch through HC + minor-variants HC + LoFreq + cohort emit.
default_params.config Adds bam_checkpoint_compression = 'none' (default) and cram_lossy_qualities = false.
CHANGELOG.md unreleased entry with the flag matrix and rationale.

Defaults

bam_checkpoint_compression = 'none' keeps SAMTOOLS_INDEX in the pipeline; the new module isn't even loaded into the DAG. Existing invocations produce identical outputs (plain .bam + .bai) to master. Opting in is a single flag.

Pre-merge gates

  • nextflow lint modules/local/samtools/view_cram.nf — clean
  • ⚠️ nextflow lint workflows/call_wf.nf shows a pre-existing addParams() DSL2-deprecation warning at line 26 (the first include, untouched by this PR) — not introduced here

Open items (documented in spec, not in this PR)

  • Round-trip equivalence test — compress → consume → compare HC output against a non-CRAM baseline run on the same cohort. Spec §8 conformance gate; not blocking this PR's opt-in default but blocking any "switch default to cram" follow-up.
  • SV path (structural_variants_analysis_wf.nf) uses its own MarkDup / recal chain and is not touched by this PR. CRAM extension to DELLY's BAM input is a follow-up if needed.
  • Samplesheet routing for pre-compressed inputs (.cram on input) is Pattern 4 of the spec.
  • genozip + zstd codecs — separate PRs; genozip license-gated; zstd license-free but modest gain on already-bgzipped BAM.

Refs

Pattern 1 of the BAM-checkpoints spec
(abc-universe/specs/active/magma-bam-checkpoints.md). Opt-in via
--bam_checkpoint_compression cram; default 'none' preserves current
behaviour byte-equivalently.

Architecture — drop-in module swap with identical channel shape

call_wf.nf currently runs SAMTOOLS_INDEX at the post-MarkDup-or-BQSR
junction, emitting (sampleName, *.bai, *.bam) which HC and the
minor-variants HC consume directly. With --bam_checkpoint_compression
cram, a new SAMTOOLS_VIEW_CRAM module runs instead and emits the
identical channel shape (sampleName, *.cram.crai, *.cram). Downstream
consumers are agnostic — GATK HaplotypeCaller, the minor-variants HC,
and LoFreq (htslib-backed) all read CRAM directly via `-R / -f <ref>`,
so the round-trip cost at consume time is zero.

This is the cleanest possible wiring: one if/else, no new emit shape,
no changes to any of the seven processes that read from this channel.

Files

  modules/local/samtools/view_cram.nf       NEW
      `samtools view -C -T <ref> --output-fmt cram,version=3.1` then
      `samtools index`. Lossless by default; quality-score binning is
      opt-in via params.cram_lossy_qualities. Same `samtools` binary
      MAGMA already uses (params.samtools_path).

  workflows/call_wf.nf
      Branches on params.bam_checkpoint_compression at the post-
      MarkDup-or-BQSR junction. Routes the resulting `aligned_indexed_ch`
      through GATK_HAPLOTYPE_CALLER, GATK_HAPLOTYPE_CALLER__MINOR_VARIANTS,
      and LOFREQ_CALL__NTM — all htslib-backed, all read CRAM directly.
      The cohort-emit `samtools_bam_ch` carries the same channel shape.

  default_params.config
      Adds:
        bam_checkpoint_compression = 'none'  // 'none' (default) | 'cram'
        cram_lossy_qualities       = false   // lossless default

  CHANGELOG.md
      `unreleased` section with the rationale + flag matrix.

Defaults

`bam_checkpoint_compression = 'none'` keeps SAMTOOLS_INDEX in the
pipeline; the new module isn't even loaded into the DAG. Existing
invocations produce identical outputs (plain `.bam` + `.bai`) to
master. Opting in is a single flag.

Codec scope

CRAM only in this PR. genozip / zstd variants are spec'd
(magma-bam-checkpoints.md §1) and will land as separate opt-in PRs
once the CRAM path has run on a real cohort.

Open items (documented in spec, not in this PR)

  - Round-trip equivalence test — compress → consume → compare HC
    output against a non-CRAM baseline run on the same cohort. The
    spec §8 conformance gate calls for this; not blocking this PR's
    opt-in default but blocking any "switch to default cram" follow-up.
  - SV path (structural_variants_analysis_wf.nf) uses its own
    MarkDup / recal chain and is not touched by this PR. CRAM
    extension to DELLY's BAM input is a follow-up if needed.
  - Samplesheet routing for pre-compressed inputs (.cram on input) is
    Pattern 4 of the spec; not in this PR.

Refs

  - Spec: abc-universe/specs/active/magma-bam-checkpoints.md
  - Sibling — XBS gVCF checkpoints (Pattern 3 of the GATK opts):
    abc-universe/specs/active/xbs-variant-calling-gatk-optimizations.md §3
  - htslib CRAM 3.x: https://samtools.github.io/hts-specs/CRAMv3.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant