Conversation
GpuMonitor now queries all physical GPUs from any process by ignoring CUDA_VISIBLE_DEVICES, so rank 0 can collect metrics for every GPU on the machine during distributed training. - Add get_all_gpu_count() that bypasses CUDA_VISIBLE_DEVICES - Add all_gpus parameter to collect_gpu_metrics() - Update GpuMonitor to use get_all_gpu_count() and all_gpus=True - Add per-GPU sub-accordions to SystemMetrics frontend (multi-GPU only) - Keep single-GPU UI unchanged (no sub-accordions) - Manual log_gpu() API still respects CUDA_VISIBLE_DEVICES Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🪼 branch checks and previews
|
🦄 change detectedThis Pull Request includes changes to the following packages.
|
🪼 branch checks and previews
Install Trackio from this PR (includes built frontend) pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/4554819fbc7d0a8182a43a2d1882e47d8fdd8c04/trackio-0.21.1-py3-none-any.whl" |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Thanks @Saba9 were you able to test on a multiGPU machine (potentially with HF jobs)? Would be great to see how it looks |
|
@abidlabs Not yet. I ran tests where I replaced |
|
@abidlabs Tested it with HF jobs on a dual GPU machine. Seems to be working! |
- Add unit labels to chart titles (%, GiB, W, °C) in SystemMetrics - Per-GPU sub-accordions default to closed - Per-GPU accordion labels use "GPU 0", "GPU 1" etc. - Strip gpu/ prefix from summary chart titles - Add HF Jobs stress test script for real multi-GPU validation - Format fixes from ruff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace manual save/restore in GPU unit tests with pytest fixture that also restores _energy_baseline (was leaking between tests) - Move keyMetricSuffixes to script section in SystemMetrics.svelte - Remove test_multi_gpu_hf_job.py (temporary monkeypatch workaround, not a useful example once the feature ships) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove test_collect_gpu_metrics_default_respects_cuda_visible (tests pre-existing behavior unchanged by this PR) - Remove test_multi_gpu_mock.py and test_single_gpu_mock.py (developer testing aids, not user-facing examples; automated tests cover this) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| assert "timestamp" in log | ||
|
|
||
|
|
||
| def test_auto_log_gpu_multi(temp_dir): |
There was a problem hiding this comment.
Not sure if this test is adding much since everything is mocked.
| assert gpu._energy_baseline == {} | ||
|
|
||
|
|
||
| def _make_mock_pynvml(num_gpus=4): |
There was a problem hiding this comment.
Again, I don't think we really need to create this whole mock fixture to test whether the gpus are being counted correctly? I think it'd be better to remove or replace with a simpler test
|
Amazing, @Saba9! I was exploring the UI, and I think it might be useful to actually plot the the system metrics from multiple GPUs on the same graph, as users may want to compare metrics across the different GPUs easily? What do you think -- here's how wandb seems to do it for reference:
I know this might get a bit crowded but what we could do is, for the System Metrics page, have a list of devices/gpus in the left sidebar, just like we have runs, allowing people to trim the number of devices if it becomes too unwieldy cc @qgallouedec @kashif for visibility |
|
thanks! looks good |



Summary
GpuMonitornow queries all physical GPUs from any process by ignoringCUDA_VISIBLE_DEVICES, so rank 0 can collect metrics for every GPU on the machine during distributed training (uses pynvml'snvmlDeviceGetCount()directly)trackio.log_gpu()still respectsCUDA_VISIBLE_DEVICESChanges
trackio/gpu.py— Addget_all_gpu_count(), addall_gpusparam tocollect_gpu_metrics(), updateGpuMonitorto use themtrackio/frontend/src/pages/SystemMetrics.svelte— Add subgroup rendering for multi-GPU, stripgpu/prefix from summary chart titlestests/unit/test_gpu.py— Unit tests forget_all_gpu_count()andcollect_gpu_metrics(all_gpus=True/False)tests/e2e-local/test_basic_logging.py— Update existing mock, add multi-GPU e2e testexamples/test_multi_gpu_mock.py— Mock script to test 4-GPU UI locallyexamples/test_single_gpu_mock.py— Mock script to test single-GPU UI locallyTest plan
pytest tests/unit/test_gpu.py— 6 tests passpytest tests/e2e-local/test_basic_logging.py— 7 tests pass (including new multi-GPU test)pytest— full suite passes (1 pre-existing flaky failure intest_import_export)examples/test_multi_gpu_mock.py→ verify System Metrics shows per-GPU accordionsexamples/test_single_gpu_mock.py→ verify single-GPU UI unchanged🤖 Generated with Claude Code