[bug] Preraining requires more memory after restoring from checkpoint (26.04.00)

### Problem

The same pretraining script requires more memory when training is conducted after restoring from a checkpoint as opposed to pretraining from scratch. 

Attached a simple training recipe to test on H100 and logs. 
1) Run training on a single H100, wait for a checkpoint to be saved, interrupt the run.
2) Run the same script again. The training should fail with torch.OutOfMemoryError: CUDA out of memory.

[01_from_scratch.log](https://github.com/user-attachments/files/27208807/01_from_scratch.log)
[02_restore_from_checkpoint.log](https://github.com/user-attachments/files/27208806/02_restore_from_checkpoint.log)
[repro.py](https://github.com/user-attachments/files/27208805/repro.py)

### Minimal repro

```shell
1. python repro.py |& tee 01_from_scratch.log, cancel after first saving the checkpoint at 10 steps
2. python repro.py |& tee 02_restore_from_checkpoint.log. Checkpoint after 10 steps will be loaded and the job should with OOM on the first training step
```

### Expected behavior

Training when restoring from checkpoint requires the same amount of GPU memory.

### Affected area

area:ckpt

### Regression?

Not sure

### Environment

nvcr.io/nvidia/nemo:26.04.00 container, single H100

### Logs

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Preraining requires more memory after restoring from checkpoint (26.04.00) #3576

Problem

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[bug] Preraining requires more memory after restoring from checkpoint (26.04.00) #3576

Description

Problem

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions