When the dataloader loads from checkpoint, it expects a path to the checkpoints directory, from which it pulls the most recent checkpoint folder and loads the relevant data.
This is a problem when continuing a completed run, as the final step of a completed run is to save a single-file checkpoint to the checkpoints directory. This messes up the dataloader when resuming, as the most recent item in the checkpoints directory is no longer a folder.
The solution for model checkpointing is to support both the checkpoints path, in which case it pulls the latest item, or a path to a particular checkpoint directory. The dataloader does not currently support the latter. We can either add this capability, or change the single-file save at the end of the run so that it goes outside the checkpoints directory, which should probably contain only checkpoint folders anyhow.
When the dataloader loads from checkpoint, it expects a path to the checkpoints directory, from which it pulls the most recent checkpoint folder and loads the relevant data.
This is a problem when continuing a completed run, as the final step of a completed run is to save a single-file checkpoint to the checkpoints directory. This messes up the dataloader when resuming, as the most recent item in the checkpoints directory is no longer a folder.
The solution for model checkpointing is to support both the checkpoints path, in which case it pulls the latest item, or a path to a particular checkpoint directory. The dataloader does not currently support the latter. We can either add this capability, or change the single-file save at the end of the run so that it goes outside the checkpoints directory, which should probably contain only checkpoint folders anyhow.