VLM Energon updates for videos, multiple images#3691
VLM Energon updates for videos, multiple images#3691huvunvidia wants to merge 17 commits intomainfrom
Conversation
Light Code ReviewCritical Issues:
Missing Test Coverage:
Suggested test cases: No perf tests impacted. |
|
/ok to test ca928cb |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 9d1fed0 |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 67e6f24 |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 78aca60 |
Adapts the unit tests to the refactored encoder which now computes visual-token counts via .prod(dim=-1) (torch syntax) on the processor's image_grid_thw / video_grid_thw outputs. The mocks previously returned np.array, causing TypeError. Also bumps max_padding_length to 512 so the expanded sequence length stays within seq_len and avoids the new SkipSample() path. Signed-off-by: Huy Vu <huvu@nvidia.com>
Adds README section describing the three composable controls that bound GPU cost per sample (min/max_pixels, max_num_images/max_num_frames, max_visual_tokens) and asserts the PEFT energon recipe defaults so the documented contract is enforced by tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test bca268d |
Pre-commit / ruff format requires two blank lines between a function and the following module-level block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 3b48e4e |
…r config sync, and visual inputs video reshape Covers three pieces of recently added behavior: - Per-sample budget limits in QwenVLTaskEncoder (max_num_images skip, max_num_frames truncation, default values). - QwenVLEnergonProvider.build_datasets propagating CLI-overridable knobs onto the task encoder before delegating to the parent. - Qwen2_5_VLVisualInputs.normalized_for_model handling video tensors and mixed image/video shapes, including already-flat passthrough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
|
/ok to test 21025ca |
…deo paths Cover process_multi_image_inputs and process_video_inputs in examples/conversion/vlm_generate_utils.py, including the qwen-vl-utils ImportError fallback and the success paths with mocked processors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Huy Vu2 <huvu@nvidia.com>
|
/ok to test a69be19 |
|
/ok to test c704146 |
| Three independent CLI-overridable controls bound a sample's GPU cost. They compose: | ||
| - **`dataset.min_pixels` / `dataset.max_pixels`** — image/frame resolutions lower and upper bound (defaults `200704` / `1003520`). | ||
| - **`dataset.max_num_images` / `dataset.max_num_frames`** - limit count of images/frames (defaults `10` / `60`). Too many images → sample is dropped. Too many frames → frame list truncated. | ||
| - **`dataset.max_visual_tokens`** — limit total visual tokens across all images and frames in a sample, computed post-rescaling as `prod(T,H,W) // merge_size²` (default `None` = disabled). Catches cases the other two miss (few images at high resolution, or many at low resolution). Exceeding samples are dropped. |
There was a problem hiding this comment.
default is 16384 in the code?
| sample.__key__, | ||
| ) | ||
| print( | ||
| f"[DEBUG] (task_encoder.py) Truncating {len(v)} frames to max_num_frames={self.max_num_frames} for sample {sample.__key__}" |
| target_length = input_ids.shape[0] | ||
|
|
||
| if target_length > self.seq_len: | ||
| logging.warning(f"Long sequence with length {target_length} found, dropped...") |
There was a problem hiding this comment.
the warning should stay i think? if visual tokens is short but text is long, there's no no warning
| res = {} | ||
| if images: | ||
| res["image_grid_thw"] = np.array([[1, 28, 28]]) # 1 tile, 28x28 | ||
| res["image_grid_thw"] = torch.tensor([[1, 28, 28]]) # 1 tile, 28x28 |
There was a problem hiding this comment.
why do we need this change? there's code in task encoder that still initializes an np array
image_thw_grids, video_thw_grids = (
np.array(image_thw_grids, dtype=np.int64),
np.array(video_thw_grids, dtype=np.int64),
)
What does this PR do ?
Verifying and improving VLM Energon to work with videos, multiple images.
Task PR: #3133
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information