Enable GDS, nsys, metrics collection for cluster usage by kingcrimsontianyu · Pull Request #330 · rapidsai/velox-testing

kingcrimsontianyu · 2026-04-30T19:17:18Z

This PR adds optional capabilities to the Presto-Velox TPC-H benchmark runner on the NVL72 EPG cluster, all controlled by new flags on launch-run.sh. Default behavior is unchanged except that GDS is now on by default.

GDS I/O (`--disable-gds` to opt out, on by default)

Workers run with KVIKIO_COMPAT_MODE=OFF so KvikIO uses GPU Direct Storage. With --disable-gds, workers fall back to POSIX I/O via KvikIO compat mode.

Tunable worker env vars (`--worker-env-file`)

Env vars to be set in each worker container can now be declared in a sourced file rather than buried in the bash scripts. Defaults live in worker.env (currently KVIKIO_TASK_SIZE=16MiB, KVIKIO_NTHREADS=16); override the path with --worker-env-file.

nsys profiling (`-p, --profile`)

Captures one .nsys-rep per query for a single worker (selectable via --nsys-worker-id). The worker image must include the nsys CLI. After pytest exits, the slurm job waits up to 10 minutes for nsys to finish flushing reports before tearing down the containers.

Metrics collection (`-m, --metrics`)

After each query, pytest pulls per-query stats from the coordinator's REST API and writes them to result_dir/metrics/<query>.json.

Nsys report and metrics uploading

Updates the post_results.py code so that nsys report and metrics can be uploaded to the online database. In particular, S3 is used to upload the large size nsys report.

Other changes

New -q, --queries LIST flag forwards a comma-separated query list through to pytest, useful for narrowing profile/metrics runs.
README updated with full parameter documentation for launch-run.sh.
run_benchmark.sh gains --profile-script-path so the slurm path can supply its own profiler functions instead of the docker default.

This PR supersedes #299

copy-pr-bot · 2026-04-30T19:17:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kingcrimsontianyu · 2026-05-04T14:50:37Z

+
+    local gds_mounts=""
+    if [[ "${ENABLE_GDS}" == "1" ]]; then
+        export MELLANOX_VISIBLE_DEVICES=all


With pyxis hooks fixed on the cluster, now we only need to include this env var on the first compute node where srun for container creation is to be executed. There is no longer a need for IB-related bind mounts, and even no need for mounting /run/udev. But It seems that we still must mount /dev/nvidia-fs*.

kingcrimsontianyu · 2026-05-07T19:55:29Z

-    log_files = sorted(benchmark_dir.glob("*.log"))
+    log_files = sorted(effective_logs_dir.glob("*.log"))
+    log_files.extend(sorted(effective_logs_dir.glob("*.out")))
+    log_files.extend(sorted(effective_logs_dir.glob("*.err")))


Fyi: Updated the scripts to generate SLURM output and error files in the logs directory and upload them to the database, so that we will know the nodelist used for the run. @misiugodfrey

Add support for GDS, nsys profiling, metrics

83f5767

kingcrimsontianyu added 10 commits April 30, 2026 19:25

Add mellanox env var for GDS

381b230

Bug fixes

9dd16aa

Bug fixes

6251867

Improve nsys sync mechanism

158b49e

Small tweak

57d8a84

Add nsys worker id

4c708cb

Fix a bug

2db34b3

Update

cce516a

Add key comments

c668cfe

Improve readme and functions impl

a660de3

kingcrimsontianyu marked this pull request as ready for review May 4, 2026 13:44

kingcrimsontianyu requested a review from a team as a code owner May 4, 2026 13:44

kingcrimsontianyu mentioned this pull request May 4, 2026

Enable GDS and nsys for cluster usage #299

Closed

Merge branch 'main' into new-enable-gds

6a709e7

kjmph reviewed May 4, 2026

View reviewed changes

Comment thread presto/slurm/presto-nvl72/run-presto-benchmarks.sh

kjmph reviewed May 4, 2026

View reviewed changes

Comment thread presto/slurm/presto-nvl72/functions.sh Outdated

Add USE_NUMA condition to the nsys branch

0289e8c

kingcrimsontianyu requested a review from kjmph May 4, 2026 14:45

kingcrimsontianyu commented May 4, 2026

View reviewed changes

kjmph reviewed May 4, 2026

View reviewed changes

Comment thread presto/slurm/presto-nvl72/functions.sh Outdated

kingcrimsontianyu added 7 commits May 4, 2026 18:34

Update

a8cdacb

Updajte

e7d46c7

Cleanup

9195acf

Add log

2b0ef98

Add pip to conda create command to avoid the cli error

ad9ed6d

Merge branch 'main' into new-enable-gds

5b09169

Update readme

55e18c7

kingcrimsontianyu requested a review from kjmph May 5, 2026 19:09

kingcrimsontianyu added 4 commits May 6, 2026 16:30

Allow incomplete test results to be posted

894ce2a

Add comments

417f6f9

Support posting nsys-rep

6f68135

Support metrics upload

3028810

karthikeyann requested review from devavret and misiugodfrey May 6, 2026 22:35

Merge branch 'main' into new-enable-gds

da341d0

TomAugspurger reviewed May 7, 2026

View reviewed changes

Comment thread benchmark_reporting_tools/post_results.py Outdated

Comment thread benchmark_reporting_tools/post_results.py

Comment thread benchmark_reporting_tools/post_results.py

Comment thread presto/slurm/presto-nvl72/worker.env

Comment thread presto/slurm/presto-nvl72/functions.sh

kingcrimsontianyu added 2 commits May 7, 2026 15:12

Remove duplicate handling of failed query

5a1b3d8

Add SLUMR standard output and error as part of log to upload

ee72154

kingcrimsontianyu commented May 7, 2026

View reviewed changes

kingcrimsontianyu requested a review from TomAugspurger May 11, 2026 14:00

TomAugspurger approved these changes May 11, 2026

View reviewed changes

kingcrimsontianyu added this to libcudf May 14, 2026

kingcrimsontianyu moved this to Slip in libcudf May 14, 2026

kingcrimsontianyu removed this from libcudf May 14, 2026

kingcrimsontianyu added this to libcudf May 14, 2026

kingcrimsontianyu moved this to Burndown in libcudf May 14, 2026

karthikeyann approved these changes May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable GDS, nsys, metrics collection for cluster usage#330

Enable GDS, nsys, metrics collection for cluster usage#330
kingcrimsontianyu wants to merge 27 commits into
rapidsai:mainfrom
kingcrimsontianyu:new-enable-gds

kingcrimsontianyu commented Apr 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

kingcrimsontianyu May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kingcrimsontianyu May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kingcrimsontianyu commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GDS I/O (--disable-gds to opt out, on by default)

Tunable worker env vars (--worker-env-file)

nsys profiling (-p, --profile)

Metrics collection (-m, --metrics)

Nsys report and metrics uploading

Other changes

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

kingcrimsontianyu May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kingcrimsontianyu May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kingcrimsontianyu commented Apr 30, 2026 •

edited

Loading

GDS I/O (`--disable-gds` to opt out, on by default)

Tunable worker env vars (`--worker-env-file`)

nsys profiling (`-p, --profile`)

Metrics collection (`-m, --metrics`)

kingcrimsontianyu May 4, 2026 •

edited

Loading