Skip to content

Can't set num_workers > 0 when training nemotron-speech-streaming-en-0.6b with multi-GPU #15355

@LouisChirol

Description

@LouisChirol

Hi,

I am trying to finetune nemotron-speech-streaming-en-0.6b on GCP's Vertex AI, using multi-GPU machines.
My script runs fine on single GPUs (A100 40Gb, H100 80Gb) whatever the dataloader num_workers, but fails with num_workers>0 on multi-GPU machines (A100x8, H100x4).

The error message is:

what(): CUDA error: initialization error terminate called after throwing an instance of 'c10::Error'

I've tried to play with the various parameters available (pin_memory, persistent_workers) as mentioned in some other issues (this one for instance), but I still cannot solve it.

The container setup I'm using to run the job is starting with:

FROM nvcr.io/nvidia/pytorch:23.10-py3

# Set working directory
WORKDIR /app

# Set environment variables
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    # Disable tokenizers parallelism warning
    TOKENIZERS_PARALLELISM=false \
    # Hydra settings for better error messages
    HYDRA_FULL_ERROR=1 \
    # NCCL settings for multi-GPU training
    NCCL_DEBUG=WARN \
    # PyTorch CUDA allocation configuration to increase shared memory usage
    PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Install PyTorch with CUDA 12.1 (proven compatible with Vertex AI A100 drivers)
# Then install NeMo toolkit with ASR support
RUN pip install --no-cache-dir \
    torch==2.4.0+cu121 \
    torchvision==0.19.0+cu121 \
    torchaudio==2.4.0+cu121 \
    --index-url https://download.pytorch.org/whl/cu121 && \
    pip install --no-cache-dir \
    "nemo_toolkit[asr]==2.6.1" \
    "google-cloud-storage>=2.10.0" \
    "google-cloud-aiplatform>=1.50.0"

I am using the ddp strategy for the trainer.
The full config is the following:

        {
            "name": run_name,
            "init_from_pretrained_model": "nvidia/nemotron-speech-streaming-en-0.6b",
            "model": {
                "sample_rate": 16000,
                "train_ds": {
                    "manifest_filepath": local_train_manifest,
                    "sample_rate": 16000,
                    "batch_size": batch_size,
                    "shuffle": True,
                    "num_workers": 4,
                    "pin_memory": False,
                    "persistent_workers": True,
                    "prefetch_factor": 2,
                    "max_duration": 30.0,
                    "min_duration": 0.5,
                    "is_tarred": False,
                    "bucketing_strategy": "synced_randomized",
                    "drop_last": True,
                    # Handle stereo audio by averaging channels to mono
                    "channel_selector": "average",
                },
                "validation_ds": {
                    "manifest_filepath": local_valid_manifest,
                    "sample_rate": 16000,
                    "batch_size": batch_size,
                    "shuffle": False,
                    "num_workers": 4,
                    "pin_memory": False,
                    "persistent_workers": True,
                    "max_duration": 30.0,
                    "min_duration": 0.5,
                    # Handle stereo audio by averaging channels to mono
                    "channel_selector": "average",
                },
                "tokenizer": {
                    "update_tokenizer": False,
                    "dir": None,
                    "type": "bpe",
                },
                "spec_augment": {
                    "_target_": "nemo.collections.asr.modules.SpectrogramAugmentation",
                    "freq_masks": 2,
                    "time_masks": 10,
                    "freq_width": 27,
                    "time_width": 0.05,
                },
                "optim": {
                    "name": "adamw",
                    "lr": learning_rate,
                    "betas": [0.9, 0.98],
                    "weight_decay": 1e-4,
                    "sched": {
                        "name": "CosineAnnealing",
                        "warmup_steps": 500,
                        "warmup_ratio": None,
                        "min_lr": 1e-6,
                    },
                },
            },
            "trainer": {
                "devices": num_gpus,
                "num_nodes": 1,
                "max_epochs": max_epochs,
                "max_steps": -1,
                "val_check_interval": 0.2,
                "accelerator": "gpu" if torch.cuda.is_available() else "cpu",
                "strategy": "ddp" if num_gpus > 1 else "auto",
                "accumulate_grad_batches": accumulate_grad_batches,
                "gradient_clip_val": 1.0,
                "precision": "bf16-mixed" if torch.cuda.is_available() else 32,
                "log_every_n_steps": 100,
                "enable_progress_bar": True,
                "num_sanity_val_steps": 2,
                "sync_batchnorm": True,
                "enable_checkpointing": False,
                "logger": False,
                # Limit batches per epoch for quick monitoring tests
                "limit_train_batches": limit_train_batches if limit_train_batches else 1.0,
                "limit_val_batches": limit_val_batches if limit_val_batches else 1.0,
            },
            "exp_manager": {
                "exp_dir": str(local_output_dir),
                "name": run_name,
                # Disable NeMo's TensorBoard logger - it fails on hparams serialization
                # We use a custom TensorBoard callback instead (see below)
                "create_tensorboard_logger": False,
                "create_checkpoint_callback": True,
                "checkpoint_callback_params": {
                    "monitor": "val_wer",
                    "mode": "min",
                    "save_top_k": 3,
                    "always_save_nemo": True,
                    "filename": "{epoch:02d}-{step:06d}-{val_wer:.4f}",
                    "save_last": True,
                },
                "resume_if_exists": True,
                "resume_ignore_no_checkpoint": True,
                "seconds_to_sleep": 120,
            },
        }

Note that I've tried ddp_spawn as mentioned in this page of Lightning's docs, but NeMo’s AudioToBPEDataset has a local TokenizerWrapper that isn’t picklable, and spawn fails as not everything involved is picklable.

I can still run trainings with num_workers=0, but as I have many small audio files to load and batch, I'm afraid that data loading becomes the bottleneck.

Any help would be very appreciated, even advices with how to setup the training as I'm quite new to that.

Thank you !

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions