Enable GDS, nsys, metrics collection for cluster usage#330
Open
kingcrimsontianyu wants to merge 27 commits into
Open
Enable GDS, nsys, metrics collection for cluster usage#330kingcrimsontianyu wants to merge 27 commits into
kingcrimsontianyu wants to merge 27 commits into
Conversation
kjmph
reviewed
May 4, 2026
kjmph
reviewed
May 4, 2026
|
|
||
| local gds_mounts="" | ||
| if [[ "${ENABLE_GDS}" == "1" ]]; then | ||
| export MELLANOX_VISIBLE_DEVICES=all |
Author
There was a problem hiding this comment.
With pyxis hooks fixed on the cluster, now we only need to include this env var on the first compute node where srun for container creation is to be executed. There is no longer a need for IB-related bind mounts, and even no need for mounting /run/udev. But It seems that we still must mount /dev/nvidia-fs*.
kjmph
reviewed
May 4, 2026
| log_files = sorted(benchmark_dir.glob("*.log")) | ||
| log_files = sorted(effective_logs_dir.glob("*.log")) | ||
| log_files.extend(sorted(effective_logs_dir.glob("*.out"))) | ||
| log_files.extend(sorted(effective_logs_dir.glob("*.err"))) |
Author
There was a problem hiding this comment.
Fyi: Updated the scripts to generate SLURM output and error files in the logs directory and upload them to the database, so that we will know the nodelist used for the run. @misiugodfrey
TomAugspurger
approved these changes
May 11, 2026
karthikeyann
approved these changes
May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds optional capabilities to the Presto-Velox TPC-H benchmark runner on the NVL72 EPG cluster, all controlled by new flags on
launch-run.sh. Default behavior is unchanged except that GDS is now on by default.GDS I/O (
--disable-gdsto opt out, on by default)Workers run with
KVIKIO_COMPAT_MODE=OFFso KvikIO uses GPU Direct Storage. With--disable-gds, workers fall back to POSIX I/O via KvikIO compat mode.Tunable worker env vars (
--worker-env-file)Env vars to be set in each worker container can now be declared in a sourced file rather than buried in the bash scripts. Defaults live in
worker.env(currentlyKVIKIO_TASK_SIZE=16MiB,KVIKIO_NTHREADS=16); override the path with--worker-env-file.nsys profiling (
-p, --profile)Captures one
.nsys-repper query for a single worker (selectable via--nsys-worker-id). The worker image must include thensysCLI. After pytest exits, the slurm job waits up to 10 minutes for nsys to finish flushing reports before tearing down the containers.Metrics collection (
-m, --metrics)After each query, pytest pulls per-query stats from the coordinator's REST API and writes them to
result_dir/metrics/<query>.json.Nsys report and metrics uploading
Updates the
post_results.pycode so that nsys report and metrics can be uploaded to the online database. In particular, S3 is used to upload the large size nsys report.Other changes
-q, --queries LISTflag forwards a comma-separated query list through to pytest, useful for narrowing profile/metrics runs.launch-run.sh.run_benchmark.shgains--profile-script-pathso the slurm path can supply its own profiler functions instead of the docker default.This PR supersedes #299