Problem
The same pretraining script requires more memory when training is conducted after restoring from a checkpoint as opposed to pretraining from scratch.
Attached a simple training recipe to test on H100 and logs.
- Run training on a single H100, wait for a checkpoint to be saved, interrupt the run.
- Run the same script again. The training should fail with torch.OutOfMemoryError: CUDA out of memory.
01_from_scratch.log
02_restore_from_checkpoint.log
repro.py
Minimal repro
1. python repro.py |& tee 01_from_scratch.log, cancel after first saving the checkpoint at 10 steps
2. python repro.py |& tee 02_restore_from_checkpoint.log. Checkpoint after 10 steps will be loaded and the job should with OOM on the first training step
Expected behavior
Training when restoring from checkpoint requires the same amount of GPU memory.
Affected area
area:ckpt
Regression?
Not sure
Environment
nvcr.io/nvidia/nemo:26.04.00 container, single H100
Logs
Problem
The same pretraining script requires more memory when training is conducted after restoring from a checkpoint as opposed to pretraining from scratch.
Attached a simple training recipe to test on H100 and logs.
01_from_scratch.log
02_restore_from_checkpoint.log
repro.py
Minimal repro
Expected behavior
Training when restoring from checkpoint requires the same amount of GPU memory.
Affected area
area:ckpt
Regression?
Not sure
Environment
nvcr.io/nvidia/nemo:26.04.00 container, single H100
Logs