Skip to content

VLM Energon updates for videos, multiple images#3691

Open
huvunvidia wants to merge 17 commits intomainfrom
huvu/vlm_energon
Open

VLM Energon updates for videos, multiple images#3691
huvunvidia wants to merge 17 commits intomainfrom
huvu/vlm_energon

Conversation

@huvunvidia
Copy link
Copy Markdown
Contributor

@huvunvidia huvunvidia commented May 5, 2026

What does this PR do ?

Verifying and improving VLM Energon to work with videos, multiple images.
Task PR: #3133

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated
Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated
Comment thread src/megatron/bridge/recipes/qwen_vl/data/energon/task_encoder.py Outdated
Comment thread tests/unit_tests/recipes/qwen_vl/test_qwen3_vl_recipes.py
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Light Code Review

Critical Issues:

  1. Commented-out raise SkipSample() is a logic bug (task_encoder.py:350): The warning says "dropping sample" but the raise is commented out, so the sample silently proceeds into truncation that the warning itself says will corrupt visual tokens. Either uncomment the raise or fix the log message.

  2. 5 bare print() debug statements (task_encoder.py:226, 239, 269, 349): Project rules prohibit bare print(). Use logging or print_rank_0(). Each debug print is redundant with a logging.warning() call directly above it. These should be removed before merge.

  3. DEBUGGING comment and commented-out code (task_encoder.py:60-61): The DEBUGGING label is not a valid justification for keeping commented-out code. The real reason (pre-decoded WDS frames) is explained in the comment block below. Clean up the stale marker and the dead line.

Missing Test Coverage:

  • The new QwenVLEnergonProvider fields (max_num_images, max_num_frames, max_visual_tokens) are not asserted in test_qwen3_vl_8b_peft_energon_task_encoder.
  • No test covers QwenVLEnergonProvider.build_datasets syncing its fields onto the task encoder.
  • No test covers the new pixel_values_videos / video_grid_thw normalization paths in Qwen2_5_VLVisualInputs.normalized_for_model().
  • No test covers the max_num_images skip, max_num_frames truncation, or max_visual_tokens skip logic in encode_sample.

Suggested test cases:

No perf tests impacted.

@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test ca928cb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 9d1fed0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 67e6f24

Huy Vu2 and others added 2 commits May 6, 2026 10:13
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 78aca60

@yaoyu-33 yaoyu-33 added area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026
huvunvidia and others added 3 commits May 7, 2026 08:22
Adapts the unit tests to the refactored encoder which now computes
visual-token counts via .prod(dim=-1) (torch syntax) on the processor's
image_grid_thw / video_grid_thw outputs. The mocks previously returned
np.array, causing TypeError. Also bumps max_padding_length to 512 so
the expanded sequence length stays within seq_len and avoids the new
SkipSample() path.

Signed-off-by: Huy Vu <huvu@nvidia.com>
Adds README section describing the three composable controls that bound
GPU cost per sample (min/max_pixels, max_num_images/max_num_frames,
max_visual_tokens) and asserts the PEFT energon recipe defaults so the
documented contract is enforced by tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test bca268d

Pre-commit / ruff format requires two blank lines between a function and
the following module-level block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 3b48e4e

Huy Vu2 and others added 2 commits May 7, 2026 14:19
…r config sync, and visual inputs video reshape

Covers three pieces of recently added behavior:
- Per-sample budget limits in QwenVLTaskEncoder (max_num_images skip,
  max_num_frames truncation, default values).
- QwenVLEnergonProvider.build_datasets propagating CLI-overridable knobs
  onto the task encoder before delegating to the parent.
- Qwen2_5_VLVisualInputs.normalized_for_model handling video tensors and
  mixed image/video shapes, including already-flat passthrough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test 21025ca

huvunvidia and others added 2 commits May 7, 2026 21:58
…deo paths

Cover process_multi_image_inputs and process_video_inputs in
examples/conversion/vlm_generate_utils.py, including the qwen-vl-utils
ImportError fallback and the success paths with mocked processors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Huy Vu2 <huvu@nvidia.com>
@cuichenx cuichenx self-requested a review May 8, 2026 16:51
@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test a69be19

@huvunvidia
Copy link
Copy Markdown
Contributor Author

/ok to test c704146

Three independent CLI-overridable controls bound a sample's GPU cost. They compose:
- **`dataset.min_pixels` / `dataset.max_pixels`** — image/frame resolutions lower and upper bound (defaults `200704` / `1003520`).
- **`dataset.max_num_images` / `dataset.max_num_frames`** - limit count of images/frames (defaults `10` / `60`). Too many images → sample is dropped. Too many frames → frame list truncated.
- **`dataset.max_visual_tokens`** — limit total visual tokens across all images and frames in a sample, computed post-rescaling as `prod(T,H,W) // merge_size²` (default `None` = disabled). Catches cases the other two miss (few images at high resolution, or many at low resolution). Exceeding samples are dropped.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is 16384 in the code?

sample.__key__,
)
print(
f"[DEBUG] (task_encoder.py) Truncating {len(v)} frames to max_num_frames={self.max_num_frames} for sample {sample.__key__}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be removed

target_length = input_ids.shape[0]

if target_length > self.seq_len:
logging.warning(f"Long sequence with length {target_length} found, dropped...")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the warning should stay i think? if visual tokens is short but text is long, there's no no warning

res = {}
if images:
res["image_grid_thw"] = np.array([[1, 28, 28]]) # 1 tile, 28x28
res["image_grid_thw"] = torch.tensor([[1, 28, 28]]) # 1 tile, 28x28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change? there's code in task encoder that still initializes an np array

            image_thw_grids, video_thw_grids = (
                np.array(image_thw_grids, dtype=np.int64),
                np.array(video_thw_grids, dtype=np.int64),
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:data Dataset builders, preprocessing, and samplers needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants