Overlap data generation with GPU training via async pipeline#11
Open
j-vaught wants to merge 1 commit intoforestagostinelli:mainfrom
Open
Overlap data generation with GPU training via async pipeline#11j-vaught wants to merge 1 commit intoforestagostinelli:mainfrom
j-vaught wants to merge 1 commit intoforestagostinelli:mainfrom
Conversation
Add a background threading.Thread that collects next-round data into a second DataBuffer while the main thread trains on the current round. First iteration is synchronous (cold start); subsequent iterations prefetch concurrently. sync_main path is unchanged. Benchmarked on RTX 6000 Ada (Cube3, 5000 iterations, 3 runs): Baseline mean: 10m 05s Async mean: 6m 17s (1.61x speedup) Training convergence unaffected. All code paths regression tested.
Author
|
Seems to have a new library imported; need to add that to correct places. |
Author
Neverminf. Part of Python default package. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey Dr. Agostinelli,
Just proposing a small change to speed up the library a tad.
Summary
threading.Threadand a secondDataBuffersync_mainpath, or training algorithmMore info
In the default (non-
sync_main) training path,update_step()blocks on_get_update_data()until all worker data arrives, then trains sequentially. The GPU is idle during data generation (~7s) and the CPU is idle during training (~2.5s). This change overlaps the two phases by starting the next round's data collection in a background thread before calling_train().Benchmark
Tested on RTX 6000 Ada Generation (48 GB), Cube3 domain,
resnet_fc.5000H_4B_bn, 5000 iterations, 3 runs per configuration.Per-update time drops from ~11.4s to ~6.8s in steady state. Training convergence (loss, solve rate) is unaffected -- differences are within run-to-run variance.
How it works
_end_update(N)completes, callstart_update(N+1)and launch a prefetch threadupdater.get_update_data()(which blocks onfrom_q.get(), releasing the GIL)_train()(PyTorch CUDA ops, also GIL-free)update_step()call, wait for the prefetch thread, swap buffers, and repeatThe
end_update(N)/start_update(N+1)ordering constraint is preserved and both run on the main thread before training begins.