Can't set num_workers > 0 when training  nemotron-speech-streaming-en-0.6b with multi-GPU

Hi,

I am trying to finetune [nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) on GCP's Vertex AI, using multi-GPU machines.
My script runs fine on single GPUs (A100 40Gb, H100 80Gb) whatever the dataloader num_workers, but fails with num_workers>0 on multi-GPU machines (A100x8, H100x4).

The error message is:
> what(): CUDA error: initialization error terminate called after throwing an instance of 'c10::Error'

I've tried to play with the various parameters available (`pin_memory`, `persistent_workers`) as mentioned in some other issues ([this one](https://github.com/Lightning-AI/pytorch-lightning/issues/19598) for instance), but I still cannot solve it.

The container setup I'm using to run the job is starting with:

```
FROM nvcr.io/nvidia/pytorch:23.10-py3

# Set working directory
WORKDIR /app

# Set environment variables
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    # Disable tokenizers parallelism warning
    TOKENIZERS_PARALLELISM=false \
    # Hydra settings for better error messages
    HYDRA_FULL_ERROR=1 \
    # NCCL settings for multi-GPU training
    NCCL_DEBUG=WARN \
    # PyTorch CUDA allocation configuration to increase shared memory usage
    PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Install PyTorch with CUDA 12.1 (proven compatible with Vertex AI A100 drivers)
# Then install NeMo toolkit with ASR support
RUN pip install --no-cache-dir \
    torch==2.4.0+cu121 \
    torchvision==0.19.0+cu121 \
    torchaudio==2.4.0+cu121 \
    --index-url https://download.pytorch.org/whl/cu121 && \
    pip install --no-cache-dir \
    "nemo_toolkit[asr]==2.6.1" \
    "google-cloud-storage>=2.10.0" \
    "google-cloud-aiplatform>=1.50.0"

```

I am using the `ddp` strategy for the trainer.
The full config is the following:
```
        {
            "name": run_name,
            "init_from_pretrained_model": "nvidia/nemotron-speech-streaming-en-0.6b",
            "model": {
                "sample_rate": 16000,
                "train_ds": {
                    "manifest_filepath": local_train_manifest,
                    "sample_rate": 16000,
                    "batch_size": batch_size,
                    "shuffle": True,
                    "num_workers": 4,
                    "pin_memory": False,
                    "persistent_workers": True,
                    "prefetch_factor": 2,
                    "max_duration": 30.0,
                    "min_duration": 0.5,
                    "is_tarred": False,
                    "bucketing_strategy": "synced_randomized",
                    "drop_last": True,
                    # Handle stereo audio by averaging channels to mono
                    "channel_selector": "average",
                },
                "validation_ds": {
                    "manifest_filepath": local_valid_manifest,
                    "sample_rate": 16000,
                    "batch_size": batch_size,
                    "shuffle": False,
                    "num_workers": 4,
                    "pin_memory": False,
                    "persistent_workers": True,
                    "max_duration": 30.0,
                    "min_duration": 0.5,
                    # Handle stereo audio by averaging channels to mono
                    "channel_selector": "average",
                },
                "tokenizer": {
                    "update_tokenizer": False,
                    "dir": None,
                    "type": "bpe",
                },
                "spec_augment": {
                    "_target_": "nemo.collections.asr.modules.SpectrogramAugmentation",
                    "freq_masks": 2,
                    "time_masks": 10,
                    "freq_width": 27,
                    "time_width": 0.05,
                },
                "optim": {
                    "name": "adamw",
                    "lr": learning_rate,
                    "betas": [0.9, 0.98],
                    "weight_decay": 1e-4,
                    "sched": {
                        "name": "CosineAnnealing",
                        "warmup_steps": 500,
                        "warmup_ratio": None,
                        "min_lr": 1e-6,
                    },
                },
            },
            "trainer": {
                "devices": num_gpus,
                "num_nodes": 1,
                "max_epochs": max_epochs,
                "max_steps": -1,
                "val_check_interval": 0.2,
                "accelerator": "gpu" if torch.cuda.is_available() else "cpu",
                "strategy": "ddp" if num_gpus > 1 else "auto",
                "accumulate_grad_batches": accumulate_grad_batches,
                "gradient_clip_val": 1.0,
                "precision": "bf16-mixed" if torch.cuda.is_available() else 32,
                "log_every_n_steps": 100,
                "enable_progress_bar": True,
                "num_sanity_val_steps": 2,
                "sync_batchnorm": True,
                "enable_checkpointing": False,
                "logger": False,
                # Limit batches per epoch for quick monitoring tests
                "limit_train_batches": limit_train_batches if limit_train_batches else 1.0,
                "limit_val_batches": limit_val_batches if limit_val_batches else 1.0,
            },
            "exp_manager": {
                "exp_dir": str(local_output_dir),
                "name": run_name,
                # Disable NeMo's TensorBoard logger - it fails on hparams serialization
                # We use a custom TensorBoard callback instead (see below)
                "create_tensorboard_logger": False,
                "create_checkpoint_callback": True,
                "checkpoint_callback_params": {
                    "monitor": "val_wer",
                    "mode": "min",
                    "save_top_k": 3,
                    "always_save_nemo": True,
                    "filename": "{epoch:02d}-{step:06d}-{val_wer:.4f}",
                    "save_last": True,
                },
                "resume_if_exists": True,
                "resume_ignore_no_checkpoint": True,
                "seconds_to_sleep": 120,
            },
        }
```

Note that I've tried `ddp_spawn` as mentioned in [this page of Lightning's docs](https://pytorch-lightning.readthedocs.io/en/1.1.8/multi_gpu.html#), but NeMo’s `AudioToBPEDataset` has a local `TokenizerWrapper` that isn’t picklable, and spawn fails as not everything involved is picklable. 

I can still run trainings with `num_workers=0`, but as I have many small audio files to load and batch, I'm afraid that data loading becomes the bottleneck.

Any help would be very appreciated, even advices with how to setup the training as I'm quite new to that.

Thank you !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't set num_workers > 0 when training nemotron-speech-streaming-en-0.6b with multi-GPU #15355

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't set num_workers > 0 when training nemotron-speech-streaming-en-0.6b with multi-GPU #15355

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions