A GPU throttle diagnostic tool for NVIDIA DGX Spark (GB10) and other NVIDIA systems. Detects clock throttling, identifies the cause, and tracks GPU health over time.
Fork of hoesing/spark-gpu-throttle-check Original tool detects USB PD throttling. This fork adds NVML direct telemetry, throttle cause identification, PCIe link monitoring, GPU utilization gating, stability analysis, and baseline drift detection.
DGX Spark systems can end up in a degraded power state where the GPU clock is capped far below normal — typically 500–850 MHz instead of ~2400 MHz. This causes significant performance degradation that's difficult to diagnose since the GPU otherwise appears healthy. Bad USB PD negotiation is a common cause, but thermal throttling, hardware slowdown, and power cap violations can produce similar symptoms.
The original tool answers: "Is the GPU throttled?"
This enhanced version answers: "Is the GPU throttled, why, is the test reliable, and is it getting worse?"
For hardware specifications, see the DGX Spark Hardware Overview.
The DGX Spark uses USB Power Delivery (PD) negotiation with its power brick. When PD negotiation fails or enters a degraded state, the GPU remains in P0 but its clock speed is capped — typically 500–850 MHz instead of the expected ~2400 MHz. The GPU reports no errors and appears healthy to nvidia-smi, but performance drops by 60–80%.
Common triggers:
- Power brick disconnected/reconnected without a full discharge cycle
- Firmware updates that change PD negotiation behavior
- Faulty or marginal USB-C power cables
- Multiple power events (outages, surges) without a clean reset
The fix is usually a full power cycle: disconnect the power brick from both the wall and the Spark, wait 60 seconds, then reconnect. This forces a fresh PD negotiation.
This tool detects the condition and tells you whether the cause is power delivery, thermal throttling, or hardware-level slowdown — so you know whether to power cycle, check cooling, or contact NVIDIA.
Replaces nvidia-smi subprocess calls with direct NVML reads via ctypes. Faster sampling, richer data, no parsing overhead. Falls back to nvidia-smi if NVML library isn't found.
Decodes the NVML throttle reason bitmask into human-readable causes:
| Bitmask | Reason | What it means |
|---|---|---|
0x01 |
GPU_IDLE |
Normal idle state |
0x04 |
SW_POWER_CAP |
Software power limit hit |
0x08 |
HW_SLOWDOWN |
Hardware-enforced slowdown (power or thermal) |
0x20 |
SW_THERMAL_SLOWDOWN |
Driver thermal limit hit |
0x40 |
HW_THERMAL_SLOWDOWN |
Hardware thermal limit hit |
0x80 |
HW_POWER_BRAKE_SLOWDOWN |
Hardware power brake engaged |
The FAIL banner tells you why clocks are low — power issue vs thermal issue vs hardware slowdown — instead of always suggesting USB PD.
Reads GPU utilization directly from NVML via nvmlDeviceGetUtilizationRates. The Util% column shows real-time GPU saturation during the test. A healthy GPU under the cuBLAS SGEMM load should show 95–100% utilization.
If average GPU utilization during the test is below 80%, the verdict is INCONCLUSIVE instead of FAIL. This prevents false throttle diagnosis when the load generator failed to saturate the GPU or another workload is competing for resources. A low-clock reading on a half-idle GPU is noise, not a diagnosis.
Measures how long the GPU takes to reach the threshold clock speed from idle. A healthy GPU ramps in under 1 second. Slow ramp-up can indicate power delivery issues.
Reports coefficient of variation (CV%) across all samples:
- < 1% CV — rock solid
- 1–5% CV — minor variance
- > 5% CV — oscillating clocks, investigate
Linear regression on temperature vs time during the test. Reports slope (°C/s), direction (rising/stable/cooling), and start→end temperatures. Rising temperature with dropping clocks = thermal throttle developing in real time.
Captures PCIe link speed and width before and after the SGEMM load via lspci. Warns if the link degraded during the test — a signal of PCIe-level instability that can precede Xid 79 (GPU fell off bus) failures.
Duplicate samples are collapsed into ranges instead of printing 20 identical rows:
# Clock Max PSt Pwr T°C Util% Throttle
────── ────── ───── ─── ───── ──── ───── ────────
1-20 1860 1911 P2 132.8 58 100 none (20x)
Save a healthy GPU snapshot and compare against it later:
# When GPU is healthy
python3 spark-gpu-throttle-check.py --save-baseline
# Later, when something seems wrong
python3 spark-gpu-throttle-check.py --compareShows delta for peak clock, average clock, power, temperature, and stability score — highlights regressions in color.
Full machine-readable report with every sample, PCIe state, utilization, and all analysis:
python3 spark-gpu-throttle-check.py --reportReports are saved to ~/.spark-throttle/reports/ with timestamps. Attach these to bug reports for complete diagnostic evidence.
Test a specific GPU or all GPUs in the system:
# Test GPU 1
python3 spark-gpu-throttle-check.py --gpu 1
# Test every GPU
python3 spark-gpu-throttle-check.py --all-gpusHigh-frequency 100ms sampling to capture the clock ramp-up curve and catch transient throttle events:
python3 spark-gpu-throttle-check.py --timeline -n 50# Standard check (20 samples, 500ms intervals)
python3 spark-gpu-throttle-check.py
# Quick check with more samples
python3 spark-gpu-throttle-check.py -n 50
# Timeline capture
python3 spark-gpu-throttle-check.py --timeline -n 40
# Save baseline when healthy
python3 spark-gpu-throttle-check.py --save-baseline
# Compare against baseline
python3 spark-gpu-throttle-check.py --compare
# Full diagnostic with report
python3 spark-gpu-throttle-check.py --report --save-baseline
# Test all GPUs
python3 spark-gpu-throttle-check.py --all-gpus
# Quiet mode for scripting
python3 spark-gpu-throttle-check.py -q| Flag | Default | Description |
|---|---|---|
-n, --samples |
20 | Number of samples to collect |
-t, --threshold |
1400 | Clock threshold (MHz) below which throttling is suspected |
-w, --warmup |
2.0 | Warm-up time (seconds) before sampling begins |
-g, --gpu |
0 | GPU index to test |
-q, --quiet |
— | Print only PASS/FAIL result line |
--timeline |
— | 100ms time-series mode |
--all-gpus |
— | Test every GPU in the system |
--save-baseline |
— | Save current results as baseline |
--compare |
— | Compare against saved baseline |
--report |
— | Export full JSON report |
0— PASS, GPU clocks are healthy1— FAIL, WARNING, or INCONCLUSIVE
| Verdict | Meaning |
|---|---|
PASS |
Clocks healthy under load |
FAIL |
Clocks below threshold — cause identified in banner |
WARNING |
Intermittent low clocks or problem throttle reasons detected |
INCONCLUSIVE |
GPU utilization too low for reliable verdict (< 80%) |
============================================================
Spark GPU Throttle Check — GPU 0 (enhanced)
============================================================
GPU: NVIDIA GeForce GTX 1080
Driver: 535.274.02
Idle:
Clock: 582 / 1911 MHz
P-state: P8
Power: 13.8 W
Temp: 48 °C
Throttle: idle
Collecting 20 samples (500ms), threshold 1400 MHz
# Clock Max PSt Pwr T°C Util% Throttle
──────── ────── ───── ─── ───── ──── ───── ────────────
1-20 1860 1911 P2 132.8 58 100 none (20x)
────────────────────────────────────────────────────────────
RESULTS
────────────────────────────────────────────────────────────
Samples: 20
Peak clock: 1860 MHz
Average clock: 1860 MHz
Avg power draw: 132.8 W
Avg temperature: 58 °C
GPU utilization: 100%
Below threshold: 0% of samples < 1400 MHz
Ramp-up time: 0.00s
Clock stability: 0.00% CV (rock solid)
Thermal trend: rising (+0.36 °C/s) 56→60°C
┌────────────────────────────────────────────────────────┐
│ PASS — GPU clocks look healthy under load. │
│ Peak: 1860 MHz, Avg: 1860 MHz │
└────────────────────────────────────────────────────────┘
██████████████████████████████████████████████████████████
█ FAIL — GPU IS THROTTLED █
█ Clock never exceeded 1400 MHz under load. █
█ Cause: POWER — bad USB PD or PSU issue. █
█ Try: disconnect power, wait 60s, reconnect. █
██████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████
█ FAIL — GPU IS THROTTLED █
█ Clock never exceeded 1400 MHz under load. █
█ Cause: THERMAL — GPU overheating. █
█ Check: fan speed, airflow, thermal paste. █
██████████████████████████████████████████████████████████
┌────────────────────────────────────────────────────────┐
│ INCONCLUSIVE — GPU load was insufficient. │
│ Avg utilization: 42% (need ≥80% for reliable verdict).│
│ The load generator may have failed to saturate the GPU.│
│ Rerun the test or check for competing workloads. │
└────────────────────────────────────────────────────────┘
────────────────────────────────────────────────────────────
BASELINE COMPARISON
────────────────────────────────────────────────────────────
Baseline from: 2026-03-21T11:10:03.975510+00:00
Peak clock (MHz) now: 1847 base: 1847 +0
Avg clock (MHz) now: 1845 base: 1841 +4
Avg power (W) now: 132.9 base: 131.2 +1.7
Avg temp (°C) now: 65 base: 66 -1
Stability (%CV) now: 0.00 base: 0.00 +0.00
If the tool reports a FAIL with a power-related cause, try disconnecting the power brick from both the wall outlet and the Spark. Wait a minute, then reconnect. Run the check again to verify:
# Capture throttled state
python3 spark-gpu-throttle-check.py --save-baseline
# Power cycle the brick (unplug from wall, wait 60s, reconnect)
# Verify recovery
python3 spark-gpu-throttle-check.py --compare- Python 3.10+
- NVIDIA GPU driver with NVML, cuBLAS, and CUDA runtime libraries
nvidia-smiin PATH (fallback only)lspcifor PCIe monitoring (optional)
No pip packages required.
- Original tool: hoesing/spark-gpu-throttle-check
- Enhanced by: parallelArchitect — human–AI collaborative engineering
- gpu-pcie-path-validator — sustained PCIe transport validation with replay counter monitoring
- unified-memory-analyzer — CUDA Unified Memory fault and migration diagnostics
- nvidia-gpu-val — gated GPU validation pipeline (PCIe → memory → compute → drift)
- nvml-unified-shim — fixes NVML memory reporting on UMA platforms (MemAvailable + SwapFree instead of MemTotal)
MIT License