Skip to content

Enable GDS, nsys, metrics collection for cluster usage#330

Open
kingcrimsontianyu wants to merge 27 commits into
rapidsai:mainfrom
kingcrimsontianyu:new-enable-gds
Open

Enable GDS, nsys, metrics collection for cluster usage#330
kingcrimsontianyu wants to merge 27 commits into
rapidsai:mainfrom
kingcrimsontianyu:new-enable-gds

Conversation

@kingcrimsontianyu
Copy link
Copy Markdown

@kingcrimsontianyu kingcrimsontianyu commented Apr 30, 2026

This PR adds optional capabilities to the Presto-Velox TPC-H benchmark runner on the NVL72 EPG cluster, all controlled by new flags on launch-run.sh. Default behavior is unchanged except that GDS is now on by default.

GDS I/O (--disable-gds to opt out, on by default)

Workers run with KVIKIO_COMPAT_MODE=OFF so KvikIO uses GPU Direct Storage. With --disable-gds, workers fall back to POSIX I/O via KvikIO compat mode.

Tunable worker env vars (--worker-env-file)

Env vars to be set in each worker container can now be declared in a sourced file rather than buried in the bash scripts. Defaults live in worker.env (currently KVIKIO_TASK_SIZE=16MiB, KVIKIO_NTHREADS=16); override the path with --worker-env-file.

nsys profiling (-p, --profile)

Captures one .nsys-rep per query for a single worker (selectable via --nsys-worker-id). The worker image must include the nsys CLI. After pytest exits, the slurm job waits up to 10 minutes for nsys to finish flushing reports before tearing down the containers.

Metrics collection (-m, --metrics)

After each query, pytest pulls per-query stats from the coordinator's REST API and writes them to result_dir/metrics/<query>.json.

Nsys report and metrics uploading

Updates the post_results.py code so that nsys report and metrics can be uploaded to the online database. In particular, S3 is used to upload the large size nsys report.

Other changes

  • New -q, --queries LIST flag forwards a comma-separated query list through to pytest, useful for narrowing profile/metrics runs.
  • README updated with full parameter documentation for launch-run.sh.
  • run_benchmark.sh gains --profile-script-path so the slurm path can supply its own profiler functions instead of the docker default.

This PR supersedes #299

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review May 4, 2026 13:44
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner May 4, 2026 13:44
Comment thread presto/slurm/presto-nvl72/run-presto-benchmarks.sh
Comment thread presto/slurm/presto-nvl72/functions.sh Outdated
@kingcrimsontianyu kingcrimsontianyu requested a review from kjmph May 4, 2026 14:45

local gds_mounts=""
if [[ "${ENABLE_GDS}" == "1" ]]; then
export MELLANOX_VISIBLE_DEVICES=all
Copy link
Copy Markdown
Author

@kingcrimsontianyu kingcrimsontianyu May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With pyxis hooks fixed on the cluster, now we only need to include this env var on the first compute node where srun for container creation is to be executed. There is no longer a need for IB-related bind mounts, and even no need for mounting /run/udev. But It seems that we still must mount /dev/nvidia-fs*.

Comment thread presto/slurm/presto-nvl72/functions.sh Outdated
@kingcrimsontianyu kingcrimsontianyu requested a review from kjmph May 5, 2026 19:09
Comment thread benchmark_reporting_tools/post_results.py Outdated
Comment thread benchmark_reporting_tools/post_results.py
Comment thread benchmark_reporting_tools/post_results.py
Comment thread presto/slurm/presto-nvl72/worker.env
Comment thread presto/slurm/presto-nvl72/functions.sh
log_files = sorted(benchmark_dir.glob("*.log"))
log_files = sorted(effective_logs_dir.glob("*.log"))
log_files.extend(sorted(effective_logs_dir.glob("*.out")))
log_files.extend(sorted(effective_logs_dir.glob("*.err")))
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi: Updated the scripts to generate SLURM output and error files in the logs directory and upload them to the database, so that we will know the nodelist used for the run. @misiugodfrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Burndown

Development

Successfully merging this pull request may close these issues.

4 participants