SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

Project Page - Dataset - Paper - Main Repository

Typographic attacks exploit the interplay between text and visual content in multimodal foundation models, causing misclassifications when misleading text is embedded within images. Existing datasets are limited in size and diversity, making it difficult to study such vulnerabilities. In this paper, we introduce SCAM, the largest and most diverse dataset of real-world typographic attack images to date, containing 1162 images across hundreds of object categories and attack words. Through extensive benchmarking of Vision-Language Models on SCAM, we demonstrate that typographic attacks significantly degrade performance, and identify that training data and model architecture influence the susceptibility to these attacks. Our findings indicate that typographic attacks remain effective against state-of-the-art Large Vision-Language Models, especially those employing vision encoders inherently vulnerable to such attacks. However, employing larger Large Language Model backbones reduces this vulnerability while simultaneously enhancing typographic understanding. Additionally, we demonstrate that synthetic attacks closely resemble real-world (handwritten) attacks, validating their use in research. Our work provides a comprehensive resource and empirical insights to facilitate future research toward robust and trustworthy multimodal AI systems.

This is a static website to showcase the SCAM dataset and the results of the robustness evaluation.

As it's static, a simple local server is sufficient to run it:

python -m http.server

Structure

index.html: Main page
styles.css: Styles
script.js: JS
images/: Images for the main page
data_images/: Dataset images folder
data_images_generator.py: Script to generate dataset images
data_converter_{lvlm,vlm}.py: Scripts to generate dataset files from the combined results CSV
data: Dataset files (generated - do not edit)
- {lvlm,vlm}_models_properties.json: (L)VLM model properties
- {lvlm,vlm}_similarity_metadata.json: Metadata about similarity scores
- {lvlm,vlm}_similarity_data.bin: Binary file with similarity scores
- {lvlm,vlm}_similarity_index.json: Index file with image metadata, grouped by base image

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
data		data
data_images		data_images
images		images
.gitignore		.gitignore
README.md		README.md
SCAM_CVPR_FoMo_poster.pdf		SCAM_CVPR_FoMo_poster.pdf
data_converter_lvlm.py		data_converter_lvlm.py
data_converter_vlm.py		data_converter_vlm.py
data_images_generator.py		data_images_generator.py
index.html		index.html
lvlm_prompt_evals.svg		lvlm_prompt_evals.svg
script.js		script.js
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

Structure

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Bliss-e-V/SCAM-project-page

Folders and files

Latest commit

History

Repository files navigation

SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models

Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages