Skip to content

Question: Resolution Mismatch (224px vs 336px) in Evaluation with Robust CLIP Encoders #2

@shouyezhe

Description

@shouyezhe

Hi @HashmatShadab,

Thank you for this excellent work on VLM robustness! I have a question about the evaluation setup.

Issue
When evaluating robust CLIP models (FARE, SimCLIP) trained at 224px on LLaVA-v1.5-7b (which uses 336px CLIP):

224px encoder outputs: [batch, 257, 1024] (16×16+1 tokens)
336px encoder outputs: [batch, 577, 1024] (24×24+1 tokens)
LLaVA-v1.5 mm_projector expects: 577 tokens
This dimension mismatch (257 vs 577) could cause:

Runtime errors
Incorrect vision-language alignment
Unreliable evaluation results
Questions
Looking at your code (clip_encoder.py:48-107), you support both resolutions (fare4 and fare4_336).

Could you clarify:

Did you retrain separate mm_projectors for each resolution (224px)?
For FARE4@224 and SimCLIP4@224 results in the paper, which mm_projector weights were used?
Are there separate checkpoint files for different encoder configurations?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions