Skip to content

fix: prevent duplicate data when using multiple DataLoader workers#169

Open
AmSach wants to merge 1 commit into
NTMC-Community:masterfrom
AmSach:fix/dataloader-duplicate-data-with-num-workers
Open

fix: prevent duplicate data when using multiple DataLoader workers#169
AmSach wants to merge 1 commit into
NTMC-Community:masterfrom
AmSach:fix/dataloader-duplicate-data-with-num-workers

Conversation

@AmSach

@AmSach AmSach commented May 11, 2026

Copy link
Copy Markdown

Fixed the bug described in issue #150.

What was wrong

When using num_workers > 0 in DataLoader, each worker process iterates over ALL batches instead of a distinct subset. This causes the model to train on duplicate data multiple times per epoch (e.g., with 30 workers, each sample is processed 30 times per epoch instead of once).

How I fixed it

  1. Dataset.init: Added num_workers parameter and _worker_id tracking
  2. DataLoader.init: Creates a worker_init_fn that sets the worker_id on the Dataset for each subprocess
  3. Dataset.iter: Now partitions batches across workers so each worker processes only 1/num_workers of the batches

Testing

  • Syntax checks pass on both modified files
  • Import tests pass

Closes #150

When using num_workers > 0 in DataLoader, each worker iterates over
ALL batches instead of a distinct subset, causing the model to train
on duplicate data multiple times per epoch.

Fix:
- Dataset now accepts num_workers parameter to partition batches
- DataLoader passes num_workers to Dataset and creates worker_init_fn
  that sets worker_id on the Dataset for each subprocess
- Dataset.__iter__ now partitions batches across workers so each worker
  processes a distinct subset

This ensures each worker handles 1/num_workers of the batches,
eliminating duplicate training data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate data returned when use num_workers param (multi-processing) in Dataloader

1 participant