Skip to content

Add multi-GPU system metrics support#481

Open
Saba9 wants to merge 6 commits intomainfrom
saba/multi-gpu
Open

Add multi-GPU system metrics support#481
Saba9 wants to merge 6 commits intomainfrom
saba/multi-gpu

Conversation

@Saba9
Copy link
Copy Markdown
Collaborator

@Saba9 Saba9 commented Apr 9, 2026

Summary

  • Backend: GpuMonitor now queries all physical GPUs from any process by ignoring CUDA_VISIBLE_DEVICES, so rank 0 can collect metrics for every GPU on the machine during distributed training (uses pynvml's nvmlDeviceGetCount() directly)
  • Frontend: System Metrics page renders per-GPU sub-accordions (collapsed by default) when multiple GPUs are detected, showing utilization, allocated memory, power, and temperature per GPU
  • Single-GPU: UI is unchanged — no sub-accordions, same summary metrics as before
  • Manual API: trackio.log_gpu() still respects CUDA_VISIBLE_DEVICES

Changes

  • trackio/gpu.py — Add get_all_gpu_count(), add all_gpus param to collect_gpu_metrics(), update GpuMonitor to use them
  • trackio/frontend/src/pages/SystemMetrics.svelte — Add subgroup rendering for multi-GPU, strip gpu/ prefix from summary chart titles
  • tests/unit/test_gpu.py — Unit tests for get_all_gpu_count() and collect_gpu_metrics(all_gpus=True/False)
  • tests/e2e-local/test_basic_logging.py — Update existing mock, add multi-GPU e2e test
  • examples/test_multi_gpu_mock.py — Mock script to test 4-GPU UI locally
  • examples/test_single_gpu_mock.py — Mock script to test single-GPU UI locally

Test plan

  • pytest tests/unit/test_gpu.py — 6 tests pass
  • pytest tests/e2e-local/test_basic_logging.py — 7 tests pass (including new multi-GPU test)
  • pytest — full suite passes (1 pre-existing flaky failure in test_import_export)
  • Frontend builds cleanly
  • Manual test with examples/test_multi_gpu_mock.py → verify System Metrics shows per-GPU accordions
  • Manual test with examples/test_single_gpu_mock.py → verify single-GPU UI unchanged
  • Test on actual multi-GPU machine

🤖 Generated with Claude Code

GpuMonitor now queries all physical GPUs from any process by ignoring
CUDA_VISIBLE_DEVICES, so rank 0 can collect metrics for every GPU on
the machine during distributed training.

- Add get_all_gpu_count() that bypasses CUDA_VISIBLE_DEVICES
- Add all_gpus parameter to collect_gpu_metrics()
- Update GpuMonitor to use get_all_gpu_count() and all_gpus=True
- Add per-GPU sub-accordions to SystemMetrics frontend (multi-GPU only)
- Keep single-GPU UI unchanged (no sub-accordions)
- Manual log_gpu() API still respects CUDA_VISIBLE_DEVICES

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 9, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio minor

  • Add multi-GPU system metrics support

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 9, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/4554819fbc7d0a8182a43a2d1882e47d8fdd8c04/trackio-0.21.1-py3-none-any.whl"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Saba9 Saba9 marked this pull request as ready for review April 9, 2026 15:57
@abidlabs
Copy link
Copy Markdown
Member

abidlabs commented Apr 9, 2026

Thanks @Saba9 were you able to test on a multiGPU machine (potentially with HF jobs)? Would be great to see how it looks

@Saba9
Copy link
Copy Markdown
Collaborator Author

Saba9 commented Apr 9, 2026

@abidlabs Not yet. I ran tests where I replaced pynvml with MagicMock to simulate multi-gpu. I'll try running it with HF jobs and post pictures soon.

@Saba9
Copy link
Copy Markdown
Collaborator Author

Saba9 commented Apr 9, 2026

@abidlabs Tested it with HF jobs on a dual GPU machine. Seems to be working!

Default View
Screenshot 2026-04-09 at 9 44 54 AM
Per-GPU metrics expanded
Screenshot 2026-04-09 at 9 46 22 AM

Saba9 and others added 4 commits April 9, 2026 09:52
- Add unit labels to chart titles (%, GiB, W, °C) in SystemMetrics
- Per-GPU sub-accordions default to closed
- Per-GPU accordion labels use "GPU 0", "GPU 1" etc.
- Strip gpu/ prefix from summary chart titles
- Add HF Jobs stress test script for real multi-GPU validation
- Format fixes from ruff

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace manual save/restore in GPU unit tests with pytest fixture
  that also restores _energy_baseline (was leaking between tests)
- Move keyMetricSuffixes to script section in SystemMetrics.svelte
- Remove test_multi_gpu_hf_job.py (temporary monkeypatch workaround,
  not a useful example once the feature ships)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove test_collect_gpu_metrics_default_respects_cuda_visible (tests
  pre-existing behavior unchanged by this PR)
- Remove test_multi_gpu_mock.py and test_single_gpu_mock.py (developer
  testing aids, not user-facing examples; automated tests cover this)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
assert "timestamp" in log


def test_auto_log_gpu_multi(temp_dir):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this test is adding much since everything is mocked.

assert gpu._energy_baseline == {}


def _make_mock_pynvml(num_gpus=4):
Copy link
Copy Markdown
Member

@abidlabs abidlabs Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think we really need to create this whole mock fixture to test whether the gpus are being counted correctly? I think it'd be better to remove or replace with a simpler test

@abidlabs
Copy link
Copy Markdown
Member

abidlabs commented Apr 9, 2026

Amazing, @Saba9! I was exploring the UI, and I think it might be useful to actually plot the the system metrics from multiple GPUs on the same graph, as users may want to compare metrics across the different GPUs easily? What do you think -- here's how wandb seems to do it for reference:

image

I know this might get a bit crowded but what we could do is, for the System Metrics page, have a list of devices/gpus in the left sidebar, just like we have runs, allowing people to trim the number of devices if it becomes too unwieldy

cc @qgallouedec @kashif for visibility

@abidlabs abidlabs requested a review from qgallouedec April 9, 2026 19:16
@kashif
Copy link
Copy Markdown
Contributor

kashif commented Apr 9, 2026

thanks! looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants