-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Hi,
I am trying to finetune nemotron-speech-streaming-en-0.6b on GCP's Vertex AI, using multi-GPU machines.
My script runs fine on single GPUs (A100 40Gb, H100 80Gb) whatever the dataloader num_workers, but fails with num_workers>0 on multi-GPU machines (A100x8, H100x4).
The error message is:
what(): CUDA error: initialization error terminate called after throwing an instance of 'c10::Error'
I've tried to play with the various parameters available (pin_memory, persistent_workers) as mentioned in some other issues (this one for instance), but I still cannot solve it.
The container setup I'm using to run the job is starting with:
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Set working directory
WORKDIR /app
# Set environment variables
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1 \
# Disable tokenizers parallelism warning
TOKENIZERS_PARALLELISM=false \
# Hydra settings for better error messages
HYDRA_FULL_ERROR=1 \
# NCCL settings for multi-GPU training
NCCL_DEBUG=WARN \
# PyTorch CUDA allocation configuration to increase shared memory usage
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Install PyTorch with CUDA 12.1 (proven compatible with Vertex AI A100 drivers)
# Then install NeMo toolkit with ASR support
RUN pip install --no-cache-dir \
torch==2.4.0+cu121 \
torchvision==0.19.0+cu121 \
torchaudio==2.4.0+cu121 \
--index-url https://download.pytorch.org/whl/cu121 && \
pip install --no-cache-dir \
"nemo_toolkit[asr]==2.6.1" \
"google-cloud-storage>=2.10.0" \
"google-cloud-aiplatform>=1.50.0"
I am using the ddp strategy for the trainer.
The full config is the following:
{
"name": run_name,
"init_from_pretrained_model": "nvidia/nemotron-speech-streaming-en-0.6b",
"model": {
"sample_rate": 16000,
"train_ds": {
"manifest_filepath": local_train_manifest,
"sample_rate": 16000,
"batch_size": batch_size,
"shuffle": True,
"num_workers": 4,
"pin_memory": False,
"persistent_workers": True,
"prefetch_factor": 2,
"max_duration": 30.0,
"min_duration": 0.5,
"is_tarred": False,
"bucketing_strategy": "synced_randomized",
"drop_last": True,
# Handle stereo audio by averaging channels to mono
"channel_selector": "average",
},
"validation_ds": {
"manifest_filepath": local_valid_manifest,
"sample_rate": 16000,
"batch_size": batch_size,
"shuffle": False,
"num_workers": 4,
"pin_memory": False,
"persistent_workers": True,
"max_duration": 30.0,
"min_duration": 0.5,
# Handle stereo audio by averaging channels to mono
"channel_selector": "average",
},
"tokenizer": {
"update_tokenizer": False,
"dir": None,
"type": "bpe",
},
"spec_augment": {
"_target_": "nemo.collections.asr.modules.SpectrogramAugmentation",
"freq_masks": 2,
"time_masks": 10,
"freq_width": 27,
"time_width": 0.05,
},
"optim": {
"name": "adamw",
"lr": learning_rate,
"betas": [0.9, 0.98],
"weight_decay": 1e-4,
"sched": {
"name": "CosineAnnealing",
"warmup_steps": 500,
"warmup_ratio": None,
"min_lr": 1e-6,
},
},
},
"trainer": {
"devices": num_gpus,
"num_nodes": 1,
"max_epochs": max_epochs,
"max_steps": -1,
"val_check_interval": 0.2,
"accelerator": "gpu" if torch.cuda.is_available() else "cpu",
"strategy": "ddp" if num_gpus > 1 else "auto",
"accumulate_grad_batches": accumulate_grad_batches,
"gradient_clip_val": 1.0,
"precision": "bf16-mixed" if torch.cuda.is_available() else 32,
"log_every_n_steps": 100,
"enable_progress_bar": True,
"num_sanity_val_steps": 2,
"sync_batchnorm": True,
"enable_checkpointing": False,
"logger": False,
# Limit batches per epoch for quick monitoring tests
"limit_train_batches": limit_train_batches if limit_train_batches else 1.0,
"limit_val_batches": limit_val_batches if limit_val_batches else 1.0,
},
"exp_manager": {
"exp_dir": str(local_output_dir),
"name": run_name,
# Disable NeMo's TensorBoard logger - it fails on hparams serialization
# We use a custom TensorBoard callback instead (see below)
"create_tensorboard_logger": False,
"create_checkpoint_callback": True,
"checkpoint_callback_params": {
"monitor": "val_wer",
"mode": "min",
"save_top_k": 3,
"always_save_nemo": True,
"filename": "{epoch:02d}-{step:06d}-{val_wer:.4f}",
"save_last": True,
},
"resume_if_exists": True,
"resume_ignore_no_checkpoint": True,
"seconds_to_sleep": 120,
},
}
Note that I've tried ddp_spawn as mentioned in this page of Lightning's docs, but NeMo’s AudioToBPEDataset has a local TokenizerWrapper that isn’t picklable, and spawn fails as not everything involved is picklable.
I can still run trainings with num_workers=0, but as I have many small audio files to load and batch, I'm afraid that data loading becomes the bottleneck.
Any help would be very appreciated, even advices with how to setup the training as I'm quite new to that.
Thank you !