VLM Energon updates for videos, multiple images by huvunvidia · Pull Request #3691 · NVIDIA-NeMo/Megatron-Bridge

huvunvidia · 2026-05-05T17:06:39Z

What does this PR do ?

Verifying and improving VLM Energon to work with videos, multiple images.
Task PR: #3133

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-05-05T17:06:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-05T17:10:58Z

Light Code Review

Critical Issues:

Commented-out raise SkipSample() is a logic bug (task_encoder.py:350): The warning says "dropping sample" but the raise is commented out, so the sample silently proceeds into truncation that the warning itself says will corrupt visual tokens. Either uncomment the raise or fix the log message.
5 bare print() debug statements (task_encoder.py:226, 239, 269, 349): Project rules prohibit bare print(). Use logging or print_rank_0(). Each debug print is redundant with a logging.warning() call directly above it. These should be removed before merge.
DEBUGGING comment and commented-out code (task_encoder.py:60-61): The DEBUGGING label is not a valid justification for keeping commented-out code. The real reason (pre-decoded WDS frames) is explained in the comment block below. Clean up the stale marker and the dead line.

Missing Test Coverage:

The new QwenVLEnergonProvider fields (max_num_images, max_num_frames, max_visual_tokens) are not asserted in test_qwen3_vl_8b_peft_energon_task_encoder.
No test covers QwenVLEnergonProvider.build_datasets syncing its fields onto the task encoder.
No test covers the new pixel_values_videos / video_grid_thw normalization paths in Qwen2_5_VLVisualInputs.normalized_for_model().
No test covers the max_num_images skip, max_num_frames truncation, or max_visual_tokens skip logic in encode_sample.

Suggested test cases:

No perf tests impacted.

huvunvidia · 2026-05-06T15:32:28Z

/ok to test ca928cb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-06T15:56:24Z

/ok to test 9d1fed0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-06T17:05:01Z

/ok to test 67e6f24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-06T17:25:20Z

/ok to test 78aca60

Adapts the unit tests to the refactored encoder which now computes visual-token counts via .prod(dim=-1) (torch syntax) on the processor's image_grid_thw / video_grid_thw outputs. The mocks previously returned np.array, causing TypeError. Also bumps max_padding_length to 512 so the expanded sequence length stays within seq_len and avoids the new SkipSample() path. Signed-off-by: Huy Vu <huvu@nvidia.com>

Adds README section describing the three composable controls that bound GPU cost per sample (min/max_pixels, max_num_images/max_num_frames, max_visual_tokens) and asserts the PEFT energon recipe defaults so the documented contract is enforced by tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-07T15:39:18Z

/ok to test bca268d

Pre-commit / ruff format requires two blank lines between a function and the following module-level block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-07T15:45:55Z

/ok to test 3b48e4e

…r config sync, and visual inputs video reshape Covers three pieces of recently added behavior: - Per-sample budget limits in QwenVLTaskEncoder (max_num_images skip, max_num_frames truncation, default values). - QwenVLEnergonProvider.build_datasets propagating CLI-overridable knobs onto the task encoder before delegating to the parent. - Qwen2_5_VLVisualInputs.normalized_for_model handling video tensors and mixed image/video shapes, including already-flat passthrough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

huvunvidia · 2026-05-07T21:20:00Z

/ok to test 21025ca

…deo paths Cover process_multi_image_inputs and process_video_inputs in examples/conversion/vlm_generate_utils.py, including the qwen-vl-utils ImportError fallback and the success paths with mocked processors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@nvidia.com>

huvunvidia · 2026-05-08T17:17:12Z

/ok to test a69be19

huvunvidia · 2026-05-08T18:02:13Z

/ok to test c704146

cuichenx · 2026-05-09T00:04:25Z

+Three independent CLI-overridable controls bound a sample's GPU cost. They compose:
+- **`dataset.min_pixels` / `dataset.max_pixels`** — image/frame resolutions lower and upper bound (defaults `200704` / `1003520`). 
+- **`dataset.max_num_images` / `dataset.max_num_frames`** - limit count of images/frames (defaults `10` / `60`). Too many images → sample is dropped. Too many frames → frame list truncated.
+- **`dataset.max_visual_tokens`** — limit total visual tokens across all images and frames in a sample, computed post-rescaling as `prod(T,H,W) // merge_size²` (default `None` = disabled). Catches cases the other two miss (few images at high resolution, or many at low resolution). Exceeding samples are dropped.


default is 16384 in the code?

cuichenx · 2026-05-09T00:04:40Z

+                        sample.__key__,
+                    )
+                    print(
+                        f"[DEBUG] (task_encoder.py) Truncating {len(v)} frames to max_num_frames={self.max_num_frames} for sample {sample.__key__}"


this should be removed

cuichenx · 2026-05-09T00:06:14Z

            target_length = input_ids.shape[0]

        if target_length > self.seq_len:
-            logging.warning(f"Long sequence with length {target_length} found, dropped...")


the warning should stay i think? if visual tokens is short but text is long, there's no no warning

cuichenx · 2026-05-09T00:09:55Z

            res = {}
            if images:
-                res["image_grid_thw"] = np.array([[1, 28, 28]])  # 1 tile, 28x28
+                res["image_grid_thw"] = torch.tensor([[1, 28, 28]])  # 1 tile, 28x28


why do we need this change? there's code in task encoder that still initializes an np array

image_thw_grids, video_thw_grids = ( np.array(image_thw_grids, dtype=np.int64), np.array(video_thw_grids, dtype=np.int64), )

workable code

60294bf

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated

claude Bot reviewed May 5, 2026

View reviewed changes

Comment thread tests/unit_tests/recipes/qwen_vl/test_qwen3_vl_recipes.py

Huy Vu2 added 2 commits May 6, 2026 07:59

adding inference code for Qwen3 for multi-images and video

33ddcd9

resolve conflict

ca928cb

style: fix ruff-format line-length violations flagged by CI

9d1fed0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

style: apply ruff-format reformats and remove debug prints

67e6f24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

Huy Vu2 and others added 2 commits May 6, 2026 10:13

style: fix remaining ruff-format violations in task_encoder.py

c1555c0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

Merge remote-tracking branch 'origin/main' into huvu/vlm_energon

78aca60

copy-pr-bot Bot temporarily deployed to test May 6, 2026 17:26 Inactive

yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026

huvunvidia and others added 3 commits May 7, 2026 08:22

Merge remote-tracking branch 'origin/main' into huvu/vlm_energon

4623a8f

copy-pr-bot Bot temporarily deployed to public May 7, 2026 15:40 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 15:46 Inactive

copy-pr-bot Bot temporarily deployed to test May 7, 2026 15:47 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 15:59 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 16:00 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 16:15 Inactive

Huy Vu2 and others added 2 commits May 7, 2026 14:19

Merge remote-tracking branch 'origin/main' into huvu/vlm_energon

21025ca

copy-pr-bot Bot temporarily deployed to public May 7, 2026 21:20 Inactive

copy-pr-bot Bot temporarily deployed to test May 7, 2026 21:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 21:33 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 21:48 Inactive

huvunvidia and others added 2 commits May 7, 2026 21:58

Merge remote-tracking branch 'origin/main' into huvu/vlm_energon

496899d

cuichenx self-requested a review May 8, 2026 16:51

Merge remote-tracking branch 'origin/main' into huvu/vlm_energon

a69be19

copy-pr-bot Bot temporarily deployed to public May 8, 2026 17:17 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 17:33 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 17:46 Inactive

fix lint

c704146

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:03 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 18:03 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:17 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:18 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:33 Inactive

cuichenx reviewed May 9, 2026

View reviewed changes

Conversation

huvunvidia commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 5, 2026

Light Code Review

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 6, 2026

Uh oh!

huvunvidia commented May 7, 2026

Uh oh!

huvunvidia commented May 7, 2026

Uh oh!

huvunvidia commented May 7, 2026

Uh oh!

huvunvidia commented May 8, 2026

Uh oh!

huvunvidia commented May 8, 2026

Uh oh!

cuichenx May 9, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx May 9, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx May 9, 2026

Choose a reason for hiding this comment

Uh oh!

cuichenx May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huvunvidia commented May 5, 2026 •

edited

Loading