Performance Drop on MMVP Compared to Qwen2.5-VL Base Model

Hi,

Thanks for your great work on TreeVGR!

I’m currently evaluating TreeVGR on multiple benchmarks and have observed consistent improvements over the base model (Qwen2.5-VL) across several datasets (e.g., MMStar, POPE, RealWorldQA, ScienceQA). However, I noticed an unexpected performance drop on MMVP.

### Results Comparison

- **Qwen2.5-VL (base model):**
  - Single-question accuracy: 77.33
  - Overall accuracy: 56.00

- **TreeVGR:**
  - Single-question accuracy: 75.00
  - Overall accuracy: 52.00

### Observations

- The performance degradation is specific to MMVP; other benchmarks show improvements.
- MMVP consists of paired questions designed to reduce language prior bias and enforce fine-grained visual reasoning.
- This suggests that TreeVGR might behave differently under settings where:
  - subtle visual distinctions are required
  - or when language priors are explicitly suppressed

### Questions

1. Have you observed similar behavior on MMVP during your experiments?
2. Could this be related to:
   - the model relying more on high-level reasoning patterns rather than fine-grained visual cues?
   - or potential sensitivity to the paired-question evaluation protocol in MMVP?
3. Are there any recommended evaluation settings (e.g., prompting, decoding strategy) specifically for MMVP?

### Additional Context

- Evaluation is conducted using VLMEvalKit with consistent settings across models.
- Exact match is used for answer comparison.

Any insights would be greatly appreciated!

Thanks again for your work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Drop on MMVP Compared to Qwen2.5-VL Base Model #12

Results Comparison

Observations

Questions

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance Drop on MMVP Compared to Qwen2.5-VL Base Model #12

Description

Results Comparison

Observations

Questions

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions