Skip to content

Enable GDS and nsys for cluster usage#299

Closed
kingcrimsontianyu wants to merge 25 commits into
rapidsai:misiug/SpaceMicePOCfrom
kingcrimsontianyu:enable-gds
Closed

Enable GDS and nsys for cluster usage#299
kingcrimsontianyu wants to merge 25 commits into
rapidsai:misiug/SpaceMicePOCfrom
kingcrimsontianyu:enable-gds

Conversation

@kingcrimsontianyu
Copy link
Copy Markdown
Contributor

No description provided.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu changed the base branch from main to misiug/cluster April 2, 2026 21:39
@kingcrimsontianyu kingcrimsontianyu changed the base branch from misiug/cluster to misiug/SpaceMicePOC April 2, 2026 21:46
@kingcrimsontianyu kingcrimsontianyu changed the title Enable GDS for cluster usage Enable GDS and nsys for cluster usage Apr 13, 2026
@kingcrimsontianyu
Copy link
Copy Markdown
Contributor Author

Superseded by #330

rapids-bot Bot pushed a commit that referenced this pull request May 19, 2026
This PR adds optional capabilities to the Presto-Velox TPC-H benchmark runner on the NVL72 EPG cluster, all controlled by new flags on `launch-run.sh`. Default behavior is unchanged except that GDS is now on by default.

## GDS I/O (`--disable-gds` to opt out, on by default)

Workers run with `KVIKIO_COMPAT_MODE=OFF` so KvikIO uses GPU Direct Storage. With `--disable-gds`, workers fall back to POSIX I/O via KvikIO compat mode.

## Tunable worker env vars (`--worker-env-file`)

Env vars to be set in each worker container can now be declared in a sourced file rather than buried in the bash scripts. Defaults live in `worker.env` (currently `KVIKIO_TASK_SIZE=16MiB`, `KVIKIO_NTHREADS=16`); override the path with `--worker-env-file`.

## nsys profiling (`-p, --profile`)

Captures one `.nsys-rep` per query for a single worker (selectable via `--nsys-worker-id`). The worker image must include the `nsys` CLI. After pytest exits, the slurm job waits up to 10 minutes for nsys to finish flushing reports before tearing down the containers.

## Metrics collection (`-m, --metrics`)

After each query, pytest pulls per-query stats from the coordinator's REST API and writes them to `result_dir/metrics/<query>.json`.

## Nsys report and metrics uploading

Updates the `post_results.py` code so that nsys report and metrics can be uploaded to the online database. In particular, S3 is used to upload the large size nsys report.

## Other changes

- New `-q, --queries LIST` flag forwards a comma-separated query list through to pytest, useful for narrowing profile/metrics runs.
- README updated with full parameter documentation for `launch-run.sh`.
- `run_benchmark.sh` gains `--profile-script-path` so the slurm path can supply its own profiler functions instead of the docker default.

This PR supersedes #299

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Tom Augspurger (https://github.com/TomAugspurger)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #330
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant