Hi,
Thanks for your great work on TreeVGR!
I’m currently evaluating TreeVGR on multiple benchmarks and have observed consistent improvements over the base model (Qwen2.5-VL) across several datasets (e.g., MMStar, POPE, RealWorldQA, ScienceQA). However, I noticed an unexpected performance drop on MMVP.
Results Comparison
-
Qwen2.5-VL (base model):
- Single-question accuracy: 77.33
- Overall accuracy: 56.00
-
TreeVGR:
- Single-question accuracy: 75.00
- Overall accuracy: 52.00
Observations
- The performance degradation is specific to MMVP; other benchmarks show improvements.
- MMVP consists of paired questions designed to reduce language prior bias and enforce fine-grained visual reasoning.
- This suggests that TreeVGR might behave differently under settings where:
- subtle visual distinctions are required
- or when language priors are explicitly suppressed
Questions
- Have you observed similar behavior on MMVP during your experiments?
- Could this be related to:
- the model relying more on high-level reasoning patterns rather than fine-grained visual cues?
- or potential sensitivity to the paired-question evaluation protocol in MMVP?
- Are there any recommended evaluation settings (e.g., prompting, decoding strategy) specifically for MMVP?
Additional Context
- Evaluation is conducted using VLMEvalKit with consistent settings across models.
- Exact match is used for answer comparison.
Any insights would be greatly appreciated!
Thanks again for your work.
Hi,
Thanks for your great work on TreeVGR!
I’m currently evaluating TreeVGR on multiple benchmarks and have observed consistent improvements over the base model (Qwen2.5-VL) across several datasets (e.g., MMStar, POPE, RealWorldQA, ScienceQA). However, I noticed an unexpected performance drop on MMVP.
Results Comparison
Qwen2.5-VL (base model):
TreeVGR:
Observations
Questions
Additional Context
Any insights would be greatly appreciated!
Thanks again for your work.