Skip to content

Performance Drop on MMVP Compared to Qwen2.5-VL Base Model #12

@chaixinning

Description

@chaixinning

Hi,

Thanks for your great work on TreeVGR!

I’m currently evaluating TreeVGR on multiple benchmarks and have observed consistent improvements over the base model (Qwen2.5-VL) across several datasets (e.g., MMStar, POPE, RealWorldQA, ScienceQA). However, I noticed an unexpected performance drop on MMVP.

Results Comparison

  • Qwen2.5-VL (base model):

    • Single-question accuracy: 77.33
    • Overall accuracy: 56.00
  • TreeVGR:

    • Single-question accuracy: 75.00
    • Overall accuracy: 52.00

Observations

  • The performance degradation is specific to MMVP; other benchmarks show improvements.
  • MMVP consists of paired questions designed to reduce language prior bias and enforce fine-grained visual reasoning.
  • This suggests that TreeVGR might behave differently under settings where:
    • subtle visual distinctions are required
    • or when language priors are explicitly suppressed

Questions

  1. Have you observed similar behavior on MMVP during your experiments?
  2. Could this be related to:
    • the model relying more on high-level reasoning patterns rather than fine-grained visual cues?
    • or potential sensitivity to the paired-question evaluation protocol in MMVP?
  3. Are there any recommended evaluation settings (e.g., prompting, decoding strategy) specifically for MMVP?

Additional Context

  • Evaluation is conducted using VLMEvalKit with consistent settings across models.
  • Exact match is used for answer comparison.

Any insights would be greatly appreciated!

Thanks again for your work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions