Hi @HashmatShadab,
Thank you for this excellent work on VLM robustness! I have a question about the evaluation setup.
Issue
When evaluating robust CLIP models (FARE, SimCLIP) trained at 224px on LLaVA-v1.5-7b (which uses 336px CLIP):
224px encoder outputs: [batch, 257, 1024] (16×16+1 tokens)
336px encoder outputs: [batch, 577, 1024] (24×24+1 tokens)
LLaVA-v1.5 mm_projector expects: 577 tokens
This dimension mismatch (257 vs 577) could cause:
Runtime errors
Incorrect vision-language alignment
Unreliable evaluation results
Questions
Looking at your code (clip_encoder.py:48-107), you support both resolutions (fare4 and fare4_336).
Could you clarify:
Did you retrain separate mm_projectors for each resolution (224px)?
For FARE4@224 and SimCLIP4@224 results in the paper, which mm_projector weights were used?
Are there separate checkpoint files for different encoder configurations?
Hi @HashmatShadab,
Thank you for this excellent work on VLM robustness! I have a question about the evaluation setup.
Issue
When evaluating robust CLIP models (FARE, SimCLIP) trained at 224px on LLaVA-v1.5-7b (which uses 336px CLIP):
224px encoder outputs: [batch, 257, 1024] (16×16+1 tokens)
336px encoder outputs: [batch, 577, 1024] (24×24+1 tokens)
LLaVA-v1.5 mm_projector expects: 577 tokens
This dimension mismatch (257 vs 577) could cause:
Runtime errors
Incorrect vision-language alignment
Unreliable evaluation results
Questions
Looking at your code (clip_encoder.py:48-107), you support both resolutions (fare4 and fare4_336).
Could you clarify:
Did you retrain separate mm_projectors for each resolution (224px)?
For FARE4@224 and SimCLIP4@224 results in the paper, which mm_projector weights were used?
Are there separate checkpoint files for different encoder configurations?