Skip to content

[bug] Preraining requires more memory after restoring from checkpoint (26.04.00) #3576

@OlegSudakov

Description

@OlegSudakov

Problem

The same pretraining script requires more memory when training is conducted after restoring from a checkpoint as opposed to pretraining from scratch.

Attached a simple training recipe to test on H100 and logs.

  1. Run training on a single H100, wait for a checkpoint to be saved, interrupt the run.
  2. Run the same script again. The training should fail with torch.OutOfMemoryError: CUDA out of memory.

01_from_scratch.log
02_restore_from_checkpoint.log
repro.py

Minimal repro

1. python repro.py |& tee 01_from_scratch.log, cancel after first saving the checkpoint at 10 steps
2. python repro.py |& tee 02_restore_from_checkpoint.log. Checkpoint after 10 steps will be loaded and the job should with OOM on the first training step

Expected behavior

Training when restoring from checkpoint requires the same amount of GPU memory.

Affected area

area:ckpt

Regression?

Not sure

Environment

nvcr.io/nvidia/nemo:26.04.00 container, single H100

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:ckptCheckpoint conversion, loading, export, and save pathsbugSomething isn't workingneeds-triageNew item needs classification and ownership

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions