Question: Resolution Mismatch (224px vs 336px) in Evaluation with Robust CLIP Encoders

Hi @HashmatShadab,

Thank you for this excellent work on VLM robustness! I have a question about the evaluation setup.

Issue
When evaluating robust CLIP models (FARE, SimCLIP) trained at 224px on LLaVA-v1.5-7b (which uses 336px CLIP):

224px encoder outputs: [batch, 257, 1024] (16×16+1 tokens)
336px encoder outputs: [batch, 577, 1024] (24×24+1 tokens)
LLaVA-v1.5 mm_projector expects: 577 tokens
This dimension mismatch (257 vs 577) could cause:

Runtime errors
Incorrect vision-language alignment
Unreliable evaluation results
Questions
Looking at your code ([clip_encoder.py:48-107](https://github.com/HashmatShadab/Robust-LLaVA/blob/main/llava/model/multimodal_encoder/clip_encoder.py#L48-L107)), you support both resolutions (fare4 and fare4_336).

Could you clarify:

Did you retrain separate mm_projectors for each resolution (224px)?
For FARE4@224 and SimCLIP4@224 results in the paper, which mm_projector weights were used?
Are there separate checkpoint files for different encoder configurations?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Resolution Mismatch (224px vs 336px) in Evaluation with Robust CLIP Encoders #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question: Resolution Mismatch (224px vs 336px) in Evaluation with Robust CLIP Encoders #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions