[model, recipe, examples] feat: add Nemotron-3 Nano Omni support by cuichenx · Pull Request #3760 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-05-08T23:30:46Z

Merge is blocked by NVIDIA/Megatron-LM#4402

Summary

Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal: MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3.
New bridge / provider / sound encoder under src/megatron/bridge/models/nemotron_omni/, recipe under src/megatron/bridge/recipes/nemotron_omni/, forward step at src/megatron/bridge/training/nemotron_omni_step.py, Energon task encoder for chat-ML samples with raw-waveform / mel audio, and supporting glue (collate fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider).
New examples/models/vlm/nemotron_3_omni/ directory with conversion script, single- / multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, and evaluation scripts.

Test plan

Locally verified end-to-end against nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 on an 8 × H100 80GB node:

Notes:

The [ssm] and [audio] extras (mamba-ssm, causal-conv1d, librosa) are required at install time; decord is needed for video sampling.
Full-parameter SFT does not fit on a single 8×H100 node (Adam fp32 state OOMs at optimizer init); use the 2-node slurm script or freeze_language_model=True for single-node runs.

🤖 Generated with Claude Code

Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal: MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3: - Bridge + provider + sound encoder under src/megatron/bridge/models/nemotron_omni/ - Recipe (CORD-V2 SFT/PEFT, VALOR32K-AVQA SFT/PEFT) under src/megatron/bridge/recipes/nemotron_omni/ - Forward step under src/megatron/bridge/training/nemotron_omni_step.py - Energon task encoder for chat-ML samples with raw-waveform/mel audio - VLM dataset glue: nemotron_omni_collate_fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider - Examples under examples/models/vlm/nemotron_3_omni/: README, conversion script, single- and multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, evaluation scripts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-05-08T23:30:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-08T23:33:49Z

+            "attention_mask": batch.attention_mask,
+            "position_ids": batch.position_ids,
+            "num_patches": batch.num_patches,
+            "sound_clips": batch.sound_clips,
+            "sound_length": batch.sound_length,
+            "imgs_sizes": batch.imgs_sizes,
+            "num_frames": batch.num_frames,
+            "num_image_tiles": batch.num_image_tiles,
+        }
+
+        vt = batch.visual_tensors if batch.visual_tensors else {}
+        raw["visual_inputs"] = GenericVisualInputs(**{k: v for k, v in vt.items() if v is not None})
+
+        # Keep sound_clips / sound_length as top-level batch keys
+        # (nemotron_omni_step picks them up directly)
+        return raw


Bug: encode_batch drops all packed-sequence metadata (cu_seqlens, cu_seqlens_unpadded, cu_seqlens_argmin, max_seqlen). These fields are computed in batch() and stored on NemotronOmniTaskBatch, but never forwarded here. When pack_sequences=True on the Energon path, the training step will never see the packing tensors — attention masking will be wrong (the model treats concatenated samples as one long sequence).

Suggested change

"attention_mask": batch.attention_mask,

"position_ids": batch.position_ids,

"num_patches": batch.num_patches,

"sound_clips": batch.sound_clips,

"sound_length": batch.sound_length,

"imgs_sizes": batch.imgs_sizes,

"num_frames": batch.num_frames,

"num_image_tiles": batch.num_image_tiles,

}

vt = batch.visual_tensors if batch.visual_tensors else {}

raw["visual_inputs"] = GenericVisualInputs(**{k: v for k, v in vt.items() if v is not None})

# Keep sound_clips / sound_length as top-level batch keys

# (nemotron_omni_step picks them up directly)

return raw

def encode_batch(self, batch: NemotronOmniTaskBatch) -> dict:

"""Convert batch to dict for the training step."""

raw = {

"tokens": batch.input_ids,

"labels": batch.labels,

"loss_mask": batch.loss_mask,

"attention_mask": batch.attention_mask,

"position_ids": batch.position_ids,

"num_patches": batch.num_patches,

"sound_clips": batch.sound_clips,

"sound_length": batch.sound_length,

"imgs_sizes": batch.imgs_sizes,

"num_frames": batch.num_frames,

"num_image_tiles": batch.num_image_tiles,

"cu_seqlens": batch.cu_seqlens,

"cu_seqlens_unpadded": batch.cu_seqlens_unpadded,

"cu_seqlens_argmin": batch.cu_seqlens_argmin,

"max_seqlen": batch.max_seqlen,

}

… for video inference Two small clarifications in the Nemotron-3 Nano Omni example README, based on a fresh end-to-end verification run: - Checkpoint Conversion → Export: call out that --trust-remote-code is required for the export step, not just import. The exporter loads the HF config, which references the custom modeling module shipped with NemotronH_Nano_Omni_Reasoning_V3. - Inference: add a callout that the video modes (rows 2 and 4) need `decord` installed, since it is not pulled in by any pyproject extra. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chen Cui <chcui@nvidia.com>

claude · 2026-05-08T23:33:56Z

+        forward_args["num_frames"] = num_frames
+    import os as _os
+    if _os.environ.get("NOMNI_DEBUG_TILES") == "1":
+        print(f"[DEBUG step] num_image_tiles={num_image_tiles if num_image_tiles is None else (num_image_tiles.shape, num_image_tiles.tolist()[:10])}", flush=True)


Debug instrumentation left in production code. The bare print() also violates the project rule (use logger or print_rank_0()). Please remove this block.

Suggested change

forward_args["num_frames"] = num_frames

import os as _os

if _os.environ.get("NOMNI_DEBUG_TILES") == "1":

print(f"[DEBUG step] num_image_tiles={num_image_tiles if num_image_tiles is None else (num_image_tiles.shape, num_image_tiles.tolist()[:10])}", flush=True)

if num_image_tiles is not None:

forward_args["num_image_tiles"] = num_image_tiles

claude · 2026-05-08T23:34:02Z

+                print(f"Freezing sound_model.{name}")
+                param.requires_grad = False
+        if freeze_sound_projection and self.llava_model.sound_projection is not None:
+            for name, param in self.llava_model.sound_projection.named_parameters():
+                print(f"Freezing sound_projection.{name}")
+                param.requires_grad = False


Bare print() calls — project rules require logging.getLogger(__name__) or print_rank_0().

Suggested change

print(f"Freezing sound_model.{name}")

param.requires_grad = False

if freeze_sound_projection and self.llava_model.sound_projection is not None:

for name, param in self.llava_model.sound_projection.named_parameters():

print(f"Freezing sound_projection.{name}")

param.requires_grad = False

if freeze_sound_model and self.llava_model.sound_model is not None:

for name, param in self.llava_model.sound_model.named_parameters():

param.requires_grad = False

if freeze_sound_projection and self.llava_model.sound_projection is not None:

for name, param in self.llava_model.sound_projection.named_parameters():

param.requires_grad = False

claude · 2026-05-08T23:34:08Z

@@ -0,0 +1,73 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


Copyright year should be 2026 per project conventions (CLAUDE.md: "Use the current year (2026) in generated content"). Same applies to hf_to_megatron_generate_nemotron_omni.py, cord_v2_inference.py, and valor32k_avqa_inference.py.

claude · 2026-05-08T23:34:12Z

@@ -0,0 +1,18 @@
+"""Nemotron Omni model family (Vision-Language + Audio) for Megatron Bridge."""


Missing NVIDIA copyright header. All new Python files under src/ require it (per CLAUDE.md: "Add NVIDIA copyright headers to new Python files (except under tests/)").

claude · 2026-05-08T23:35:10Z

Light Code Review

Critical

encode_batch() drops packing metadata (nemotron_omni_task_encoder.py): cu_seqlens, cu_seqlens_unpadded, cu_seqlens_argmin, and max_seqlen are computed in batch() and stored on NemotronOmniTaskBatch, but encode_batch() does not forward them to the training step dict. When pack_sequences=True on the Energon path, the model will silently treat packed sequences as a single long sequence with wrong attention masking. See inline comment for fix.
Debug print left in nemotron_omni_step.py:270-273: import os as _os + env-gated print() for NOMNI_DEBUG_TILES. Should be removed before merge.

Minor

Bare print() in modeling_nemotron_omni.py:42-47: freeze() uses bare print -- should use logger or print_rank_0() per project rules.
Copyright year: nemotron_omni_sound.py, hf_to_megatron_generate_nemotron_omni.py, cord_v2_inference.py, and valor32k_avqa_inference.py use Copyright (c) 2025 but project convention is 2026.
Missing copyright header: src/megatron/bridge/models/nemotron_omni/init.py has no NVIDIA copyright block.

Missing test coverage

This PR adds a new model family (bridge, provider, recipe, task encoder, collate, forward step) with no unit or functional tests. Per the adding-model-support guidelines, the following are expected:

Unit tests (tests/unit_tests/models/nemotron_omni/): test_nemotron_omni_bridge.py (mock HF config, verify provider_bridge() mapping and mapping_registry() coverage), test_nemotron_omni_provider.py (verify provider defaults, freeze logic, sound encoder construction)
Functional tests (tests/functional_tests/models/nemotron_omni/): test_nemotron_omni_conversion.py (toy model HF/Megatron roundtrip)
Recipe unit tests: monkeypatched AutoBridge, verify ConfigContainer structure for each recipe function

Suggested test cases

No perf tests impacted.

claude Bot reviewed May 8, 2026

View reviewed changes

cuichenx mentioned this pull request May 8, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760

[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760
cuichenx wants to merge 2 commits intomainfrom
chcui/nemotron_3_omni_pr

cuichenx commented May 8, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,73 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,18 @@
		"""Nemotron Omni model family (Vision-Language + Audio) for Megatron Bridge."""

Conversation

cuichenx commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 8, 2026

Light Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuichenx commented May 8, 2026 •

edited

Loading