Skip to content

[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760

Open
cuichenx wants to merge 2 commits intomainfrom
chcui/nemotron_3_omni_pr
Open

[model, recipe, examples] feat: add Nemotron-3 Nano Omni support#3760
cuichenx wants to merge 2 commits intomainfrom
chcui/nemotron_3_omni_pr

Conversation

@cuichenx
Copy link
Copy Markdown
Contributor

@cuichenx cuichenx commented May 8, 2026

Merge is blocked by NVIDIA/Megatron-LM#4402

Summary

  • Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal: MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3.
  • New bridge / provider / sound encoder under src/megatron/bridge/models/nemotron_omni/, recipe under src/megatron/bridge/recipes/nemotron_omni/, forward step at src/megatron/bridge/training/nemotron_omni_step.py, Energon task encoder for chat-ML samples with raw-waveform / mel audio, and supporting glue (collate fn, valor32k_avqa maker, audiohandler decoder, packing toggle on EnergonProvider).
  • New examples/models/vlm/nemotron_3_omni/ directory with conversion script, single- / multi-modality inference, slurm SFT/LoRA scripts, data-prep scripts, and evaluation scripts.

Test plan

Locally verified end-to-end against nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 on an 8 × H100 80GB node:

  • HF → Megatron import (33B params, 7517 tensors, low-memory save) — ✅
  • Megatron → HF export (--not-strict, 4 expected-missing tensors regenerated from config) — ✅
  • HF↔Megatron multi-GPU roundtrip (TP=2 EP=2) — ✅ all weights match
  • Inference: image+text (1 GPU, from converted ckpt) — ✅ detailed H100 spec table description
  • Inference: image+text (1 GPU, on-the-fly conversion) — ✅
  • Inference: video+text (8 GPU, TP=4 EP=4) — ✅ 15 frames 上·1.12s, plant/flower description (requires decord)
  • Inference: audio+text (1 GPU) — ✅ exact transcription match
  • Inference: video+audio+text (8 GPU, TP=4 EP=2) — ✅ combined description
  • Image SFT smoke test — CORD-V2, 20 iters, frozen LM (single-node 8×H100) — ✅ lm loss 1.108→1.090, ~2.06 s/iter, peak ~36.7 GiB/GPU
  • Image PEFT smoke test — CORD-V2 LoRA, 20 iters — ✅ lm loss 1.022→0.558, ~2.35 s/iter, peak ~20.3 GiB/GPU
  • Audio SFT smoke test — CV17, 10 iters (in flight at PR open)

Notes:

  • The [ssm] and [audio] extras (mamba-ssm, causal-conv1d, librosa) are required at install time; decord is needed for video sampling.
  • Full-parameter SFT does not fit on a single 8×H100 node (Adam fp32 state OOMs at optimizer init); use the 2-node slurm script or freeze_language_model=True for single-node runs.

🤖 Generated with Claude Code

Adds end-to-end support for Nemotron-3 Nano Omni (30B-A3B MoE multimodal:
MoE Mamba/attention hybrid LM + RADIO vision tower + Parakeet sound
encoder), targeting HF architecture NemotronH_Nano_Omni_Reasoning_V3:

- Bridge + provider + sound encoder under
  src/megatron/bridge/models/nemotron_omni/
- Recipe (CORD-V2 SFT/PEFT, VALOR32K-AVQA SFT/PEFT) under
  src/megatron/bridge/recipes/nemotron_omni/
- Forward step under src/megatron/bridge/training/nemotron_omni_step.py
- Energon task encoder for chat-ML samples with raw-waveform/mel audio
- VLM dataset glue: nemotron_omni_collate_fn, valor32k_avqa maker,
  audiohandler decoder, packing toggle on EnergonProvider
- Examples under examples/models/vlm/nemotron_3_omni/: README, conversion
  script, single- and multi-modality inference, slurm SFT/LoRA scripts,
  data-prep scripts, evaluation scripts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment on lines +653 to +668
"attention_mask": batch.attention_mask,
"position_ids": batch.position_ids,
"num_patches": batch.num_patches,
"sound_clips": batch.sound_clips,
"sound_length": batch.sound_length,
"imgs_sizes": batch.imgs_sizes,
"num_frames": batch.num_frames,
"num_image_tiles": batch.num_image_tiles,
}

vt = batch.visual_tensors if batch.visual_tensors else {}
raw["visual_inputs"] = GenericVisualInputs(**{k: v for k, v in vt.items() if v is not None})

# Keep sound_clips / sound_length as top-level batch keys
# (nemotron_omni_step picks them up directly)
return raw
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: encode_batch drops all packed-sequence metadata (cu_seqlens, cu_seqlens_unpadded, cu_seqlens_argmin, max_seqlen). These fields are computed in batch() and stored on NemotronOmniTaskBatch, but never forwarded here. When pack_sequences=True on the Energon path, the training step will never see the packing tensors — attention masking will be wrong (the model treats concatenated samples as one long sequence).

Suggested change
"attention_mask": batch.attention_mask,
"position_ids": batch.position_ids,
"num_patches": batch.num_patches,
"sound_clips": batch.sound_clips,
"sound_length": batch.sound_length,
"imgs_sizes": batch.imgs_sizes,
"num_frames": batch.num_frames,
"num_image_tiles": batch.num_image_tiles,
}
vt = batch.visual_tensors if batch.visual_tensors else {}
raw["visual_inputs"] = GenericVisualInputs(**{k: v for k, v in vt.items() if v is not None})
# Keep sound_clips / sound_length as top-level batch keys
# (nemotron_omni_step picks them up directly)
return raw
def encode_batch(self, batch: NemotronOmniTaskBatch) -> dict:
"""Convert batch to dict for the training step."""
raw = {
"tokens": batch.input_ids,
"labels": batch.labels,
"loss_mask": batch.loss_mask,
"attention_mask": batch.attention_mask,
"position_ids": batch.position_ids,
"num_patches": batch.num_patches,
"sound_clips": batch.sound_clips,
"sound_length": batch.sound_length,
"imgs_sizes": batch.imgs_sizes,
"num_frames": batch.num_frames,
"num_image_tiles": batch.num_image_tiles,
"cu_seqlens": batch.cu_seqlens,
"cu_seqlens_unpadded": batch.cu_seqlens_unpadded,
"cu_seqlens_argmin": batch.cu_seqlens_argmin,
"max_seqlen": batch.max_seqlen,
}

… for video inference

Two small clarifications in the Nemotron-3 Nano Omni example README,
based on a fresh end-to-end verification run:

- Checkpoint Conversion → Export: call out that --trust-remote-code is
  required for the export step, not just import. The exporter loads the
  HF config, which references the custom modeling module shipped with
  NemotronH_Nano_Omni_Reasoning_V3.
- Inference: add a callout that the video modes (rows 2 and 4) need
  `decord` installed, since it is not pulled in by any pyproject extra.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Comment on lines +270 to +273
forward_args["num_frames"] = num_frames
import os as _os
if _os.environ.get("NOMNI_DEBUG_TILES") == "1":
print(f"[DEBUG step] num_image_tiles={num_image_tiles if num_image_tiles is None else (num_image_tiles.shape, num_image_tiles.tolist()[:10])}", flush=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug instrumentation left in production code. The bare print() also violates the project rule (use logger or print_rank_0()). Please remove this block.

Suggested change
forward_args["num_frames"] = num_frames
import os as _os
if _os.environ.get("NOMNI_DEBUG_TILES") == "1":
print(f"[DEBUG step] num_image_tiles={num_image_tiles if num_image_tiles is None else (num_image_tiles.shape, num_image_tiles.tolist()[:10])}", flush=True)
if num_image_tiles is not None:
forward_args["num_image_tiles"] = num_image_tiles

Comment on lines +42 to +47
print(f"Freezing sound_model.{name}")
param.requires_grad = False
if freeze_sound_projection and self.llava_model.sound_projection is not None:
for name, param in self.llava_model.sound_projection.named_parameters():
print(f"Freezing sound_projection.{name}")
param.requires_grad = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare print() calls — project rules require logging.getLogger(__name__) or print_rank_0().

Suggested change
print(f"Freezing sound_model.{name}")
param.requires_grad = False
if freeze_sound_projection and self.llava_model.sound_projection is not None:
for name, param in self.llava_model.sound_projection.named_parameters():
print(f"Freezing sound_projection.{name}")
param.requires_grad = False
if freeze_sound_model and self.llava_model.sound_model is not None:
for name, param in self.llava_model.sound_model.named_parameters():
param.requires_grad = False
if freeze_sound_projection and self.llava_model.sound_projection is not None:
for name, param in self.llava_model.sound_projection.named_parameters():
param.requires_grad = False

@@ -0,0 +1,73 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year should be 2026 per project conventions (CLAUDE.md: "Use the current year (2026) in generated content"). Same applies to hf_to_megatron_generate_nemotron_omni.py, cord_v2_inference.py, and valor32k_avqa_inference.py.

@@ -0,0 +1,18 @@
"""Nemotron Omni model family (Vision-Language + Audio) for Megatron Bridge."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NVIDIA copyright header. All new Python files under src/ require it (per CLAUDE.md: "Add NVIDIA copyright headers to new Python files (except under tests/)").

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Light Code Review

Critical

  1. encode_batch() drops packing metadata (nemotron_omni_task_encoder.py): cu_seqlens, cu_seqlens_unpadded, cu_seqlens_argmin, and max_seqlen are computed in batch() and stored on NemotronOmniTaskBatch, but encode_batch() does not forward them to the training step dict. When pack_sequences=True on the Energon path, the model will silently treat packed sequences as a single long sequence with wrong attention masking. See inline comment for fix.

  2. Debug print left in nemotron_omni_step.py:270-273: import os as _os + env-gated print() for NOMNI_DEBUG_TILES. Should be removed before merge.

Minor

  1. Bare print() in modeling_nemotron_omni.py:42-47: freeze() uses bare print -- should use logger or print_rank_0() per project rules.

  2. Copyright year: nemotron_omni_sound.py, hf_to_megatron_generate_nemotron_omni.py, cord_v2_inference.py, and valor32k_avqa_inference.py use Copyright (c) 2025 but project convention is 2026.

  3. Missing copyright header: src/megatron/bridge/models/nemotron_omni/init.py has no NVIDIA copyright block.

Missing test coverage

This PR adds a new model family (bridge, provider, recipe, task encoder, collate, forward step) with no unit or functional tests. Per the adding-model-support guidelines, the following are expected:

  • Unit tests (tests/unit_tests/models/nemotron_omni/): test_nemotron_omni_bridge.py (mock HF config, verify provider_bridge() mapping and mapping_registry() coverage), test_nemotron_omni_provider.py (verify provider defaults, freeze logic, sound encoder construction)
  • Functional tests (tests/functional_tests/models/nemotron_omni/): test_nemotron_omni_conversion.py (toy model HF/Megatron roundtrip)
  • Recipe unit tests: monkeypatched AutoBridge, verify ConfigContainer structure for each recipe function

Suggested test cases

No perf tests impacted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant