[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760
[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760
Conversation
Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal: MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3: - Bridge + provider + sound encoder under src/megatron/bridge/models/nemotron_omni/ - Recipe (CORD-V2 SFT/PEFT, VALOR32K-AVQA SFT/PEFT) under src/megatron/bridge/recipes/nemotron_omni/ - Forward step under src/megatron/bridge/training/nemotron_omni_step.py - Energon task encoder for chat-ML samples with raw-waveform/mel audio - VLM dataset glue: nemotron_omni_collate_fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider - Examples under examples/models/vlm/nemotron_3_omni/: README, conversion script, single- and multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, evaluation scripts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>
| "attention_mask": batch.attention_mask, | ||
| "position_ids": batch.position_ids, | ||
| "num_patches": batch.num_patches, | ||
| "sound_clips": batch.sound_clips, | ||
| "sound_length": batch.sound_length, | ||
| "imgs_sizes": batch.imgs_sizes, | ||
| "num_frames": batch.num_frames, | ||
| "num_image_tiles": batch.num_image_tiles, | ||
| } | ||
|
|
||
| vt = batch.visual_tensors if batch.visual_tensors else {} | ||
| raw["visual_inputs"] = GenericVisualInputs(**{k: v for k, v in vt.items() if v is not None}) | ||
|
|
||
| # Keep sound_clips / sound_length as top-level batch keys | ||
| # (nemotron_omni_step picks them up directly) | ||
| return raw |
There was a problem hiding this comment.
Bug: encode_batch drops all packed-sequence metadata (cu_seqlens, cu_seqlens_unpadded, cu_seqlens_argmin, max_seqlen). These fields are computed in batch() and stored on NemotronOmniTaskBatch, but never forwarded here. When pack_sequences=True on the Energon path, the training step will never see the packing tensors — attention masking will be wrong (the model treats concatenated samples as one long sequence).
| "attention_mask": batch.attention_mask, | |
| "position_ids": batch.position_ids, | |
| "num_patches": batch.num_patches, | |
| "sound_clips": batch.sound_clips, | |
| "sound_length": batch.sound_length, | |
| "imgs_sizes": batch.imgs_sizes, | |
| "num_frames": batch.num_frames, | |
| "num_image_tiles": batch.num_image_tiles, | |
| } | |
| vt = batch.visual_tensors if batch.visual_tensors else {} | |
| raw["visual_inputs"] = GenericVisualInputs(**{k: v for k, v in vt.items() if v is not None}) | |
| # Keep sound_clips / sound_length as top-level batch keys | |
| # (nemotron_omni_step picks them up directly) | |
| return raw | |
| def encode_batch(self, batch: NemotronOmniTaskBatch) -> dict: | |
| """Convert batch to dict for the training step.""" | |
| raw = { | |
| "tokens": batch.input_ids, | |
| "labels": batch.labels, | |
| "loss_mask": batch.loss_mask, | |
| "attention_mask": batch.attention_mask, | |
| "position_ids": batch.position_ids, | |
| "num_patches": batch.num_patches, | |
| "sound_clips": batch.sound_clips, | |
| "sound_length": batch.sound_length, | |
| "imgs_sizes": batch.imgs_sizes, | |
| "num_frames": batch.num_frames, | |
| "num_image_tiles": batch.num_image_tiles, | |
| "cu_seqlens": batch.cu_seqlens, | |
| "cu_seqlens_unpadded": batch.cu_seqlens_unpadded, | |
| "cu_seqlens_argmin": batch.cu_seqlens_argmin, | |
| "max_seqlen": batch.max_seqlen, | |
| } |
… for video inference Two small clarifications in the Nemotron-3 Nano Omni example README, based on a fresh end-to-end verification run: - Checkpoint Conversion → Export: call out that --trust-remote-code is required for the export step, not just import. The exporter loads the HF config, which references the custom modeling module shipped with NemotronH_Nano_Omni_Reasoning_V3. - Inference: add a callout that the video modes (rows 2 and 4) need `decord` installed, since it is not pulled in by any pyproject extra. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>
| forward_args["num_frames"] = num_frames | ||
| import os as _os | ||
| if _os.environ.get("NOMNI_DEBUG_TILES") == "1": | ||
| print(f"[DEBUG step] num_image_tiles={num_image_tiles if num_image_tiles is None else (num_image_tiles.shape, num_image_tiles.tolist()[:10])}", flush=True) |
There was a problem hiding this comment.
Debug instrumentation left in production code. The bare print() also violates the project rule (use logger or print_rank_0()). Please remove this block.
| forward_args["num_frames"] = num_frames | |
| import os as _os | |
| if _os.environ.get("NOMNI_DEBUG_TILES") == "1": | |
| print(f"[DEBUG step] num_image_tiles={num_image_tiles if num_image_tiles is None else (num_image_tiles.shape, num_image_tiles.tolist()[:10])}", flush=True) | |
| if num_image_tiles is not None: | |
| forward_args["num_image_tiles"] = num_image_tiles |
| print(f"Freezing sound_model.{name}") | ||
| param.requires_grad = False | ||
| if freeze_sound_projection and self.llava_model.sound_projection is not None: | ||
| for name, param in self.llava_model.sound_projection.named_parameters(): | ||
| print(f"Freezing sound_projection.{name}") | ||
| param.requires_grad = False |
There was a problem hiding this comment.
Bare print() calls — project rules require logging.getLogger(__name__) or print_rank_0().
| print(f"Freezing sound_model.{name}") | |
| param.requires_grad = False | |
| if freeze_sound_projection and self.llava_model.sound_projection is not None: | |
| for name, param in self.llava_model.sound_projection.named_parameters(): | |
| print(f"Freezing sound_projection.{name}") | |
| param.requires_grad = False | |
| if freeze_sound_model and self.llava_model.sound_model is not None: | |
| for name, param in self.llava_model.sound_model.named_parameters(): | |
| param.requires_grad = False | |
| if freeze_sound_projection and self.llava_model.sound_projection is not None: | |
| for name, param in self.llava_model.sound_projection.named_parameters(): | |
| param.requires_grad = False |
| @@ -0,0 +1,73 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year should be 2026 per project conventions (CLAUDE.md: "Use the current year (2026) in generated content"). Same applies to hf_to_megatron_generate_nemotron_omni.py, cord_v2_inference.py, and valor32k_avqa_inference.py.
| @@ -0,0 +1,18 @@ | |||
| """Nemotron Omni model family (Vision-Language + Audio) for Megatron Bridge.""" | |||
There was a problem hiding this comment.
Missing NVIDIA copyright header. All new Python files under src/ require it (per CLAUDE.md: "Add NVIDIA copyright headers to new Python files (except under tests/)").
Light Code ReviewCritical
Minor
Missing test coverage This PR adds a new model family (bridge, provider, recipe, task encoder, collate, forward step) with no unit or functional tests. Per the adding-model-support guidelines, the following are expected:
Suggested test cases No perf tests impacted. |
Merge is blocked by NVIDIA/Megatron-LM#4402
Summary
NemotronH_Nano_Omni_Reasoning_V3.src/megatron/bridge/models/nemotron_omni/, recipe undersrc/megatron/bridge/recipes/nemotron_omni/, forward step atsrc/megatron/bridge/training/nemotron_omni_step.py, Energon task encoder for chat-ML samples with raw-waveform / mel audio, and supporting glue (collate fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider).examples/models/vlm/nemotron_3_omni/directory with conversion script, single- / multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, and evaluation scripts.Test plan
Locally verified end-to-end against
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16on an 8 × H100 80GB node:--not-strict, 4 expected-missing tensors regenerated from config) — ✅decord)Notes:
[ssm]and[audio]extras (mamba-ssm,causal-conv1d,librosa) are required at install time;decordis needed for video sampling.freeze_language_model=Truefor single-node runs.🤖 Generated with Claude Code