Skip to content

fix: prevent storeDir race condition with parallel chromosome tasks on cloud executors#482

Open
ikhrustalev wants to merge 1 commit into
PGScatalog:mainfrom
haplotypelabs:fix/storeDir-race-condition
Open

fix: prevent storeDir race condition with parallel chromosome tasks on cloud executors#482
ikhrustalev wants to merge 1 commit into
PGScatalog:mainfrom
haplotypelabs:fix/storeDir-race-condition

Conversation

@ikhrustalev

@ikhrustalev ikhrustalev commented Mar 23, 2026

Copy link
Copy Markdown

Summary

Fixes #481

When running multiple chromosomes in parallel on cloud executors (e.g. Google Cloud Batch with GCS FUSE), processes using storeDir write versions.yml to the same path concurrently. The data files don't collide (they include chromosome in the filename), but versions.yml is always the same name, causing stale file handles or missing output files.

This PR partitions storeDir by sample ID and chromosome using closures for lazy evaluation, so each task gets its own isolated directory while preserving cross-run caching.

Before:

cachedir = params.genotypes_cache ? file(params.genotypes_cache) : workDir
storeDir cachedir / "genomes" / "recoded"

After:

storeDir { (params.genotypes_cache ? file(params.genotypes_cache) : workDir).resolve("genomes/${meta.id}/recoded/${meta.chrom}") }

Changes

  • 14 files changed, all modules using storeDir
  • Replaced eager path construction with closure-based lazy evaluation
  • Added meta.id and meta.chrom (where available) to storeDir paths

Testing

  • Standard test profile passes locally (nextflow run . -profile test,docker)
  • Verified on Google Cloud Batch with 22-chromosome VCF input (sample HG00096, PGS002209 with --run_ancestry)
  • Pipeline completes with zero errors vs. 17+ retry errors before the fix
  • Scores are identical to runs using the original code

…n cloud executors

When running multiple chromosomes in parallel on cloud executors (e.g. Google Batch
with GCS FUSE), processes using storeDir write versions.yml to the same path
concurrently, causing stale file handles or missing output files.

Partition storeDir by sample ID and chromosome using closures for lazy evaluation,
so each task gets its own isolated directory while preserving cross-run caching.

Fixes PGScatalog#481
@smlmbrt smlmbrt requested a review from nebfield March 23, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storeDir race condition: parallel chromosome tasks fail with "Missing output file versions.yml" on Google Cloud Batch

1 participant