Skip to content

fix: change scorefiles to queue channel to reduce memory usage#488

Draft
katgorski wants to merge 4 commits into
PGScatalog:mainfrom
katgorski:change-scorefiles-to-queue-channel
Draft

fix: change scorefiles to queue channel to reduce memory usage#488
katgorski wants to merge 4 commits into
PGScatalog:mainfrom
katgorski:change-scorefiles-to-queue-channel

Conversation

@katgorski

Copy link
Copy Markdown

Fix for #475

Splits scores into a queue channel after the DOWNLOAD_SCOREFILES process so each score file is formatted individually in the downstream analyses, without modifying match processes. Report process and template also modified in order to handle multiple json score metadata files instead of a single json file. Subtle changes in handling chain files to accommodate the use of a queue channel as input into FORMAT_SCOREFILES.

Aimed to only make workflow changes and not modify any processes; will likely look at the pygscatalog repo later to check out memory usage there.

Draft because still in the process of testing things thoroughly, but if there's anything major I missed let me know so I can modify. Checking mechanics locally things seem to be fine, but the tests already present in the repo all run off their own little nf scripts; browsing through there isn't something that will easily confirm the behavior of the changes I made since it's mostly workflow related. Also need a test set to test liftover, modified that channel a bit in order to properly pass in the chain files multiple times.

copilot's summary below:

Scorefile and log file handling improvements:

  • Changed the scorefile input in INPUT_CHECK to a queue channel for better compatibility with Nextflow's channel operations, and updated downstream usage accordingly. [1] [2] [3]
  • Updated the SCORE_REPORT process to accept log_scorefiles as a separate input, ensuring that scorefile metadata is passed explicitly and consistently. [1] [2]
  • Modified the REPORT workflow to collect log_scorefiles into a channel before passing to SCORE_REPORT, and removed an unnecessary combine operation. [1] [2]

Scorefile metadata loading:

  • Updated the report generation script (report.qmd) to load all JSON files in the working directory, rather than relying on a single path from parameters, allowing for more flexible metadata aggregation.

Workflow input and channel management:

  • Improved the handling of optional chain files and scorefile flattening in the main PGSCCALC workflow to ensure correct input types and avoid issues with empty channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant