Zero-shot anomaly detection (ZSAD) methods can effectively address data collecting difficulty and scarcity in industrial scenarios.However single modal detection is not comprehensive, as it fails to capture complementary information across different modalities. Hence, we propose Visual-Text Interaction with Guided Attention model (VIGA), a multimodal zero-shot anomaly detection(MM-ZSAD) method, which identifies anomalies with diverse data sources. In this framework, VIGA introduces Tripartite Interactive Prompt (TIP) module that reduces redundancy and enables adaptive alignment of multi-view and multimodal features. Meanwhile, we facilitate the interaction between global and local visual features and respective textual prompts, thereby further refining the alignment between vision and language. To meet the challenge of attention dispersion inherent in unconstrained learning, we propose Mask Guided Attention Shaping (MGAS) strategy which incorporates prior semantic knowledge to provide explicit guidance and enhancemodel focus. VIGA achieves state-of-the-art performance on the MM-ZSADtask across the MVTec3D-AD and Eyecandies datasets, revealing its superiority in detecting unseen object categories.

Download the dataset below:
We prepare the rendering images of MVTecAD-3D, Eyecandies following the method proposed in PointAD.
| Dataset | Originial version | Rendering version (BaiDu Disk) | Rendering version (Google Driver) |
|---|---|---|---|
| MVTec3D-AD | Ori | BaiDu Disk | [Google Driver] |
| Eyecandies | Ori | BaiDu Disk | [Google Driver] |
Take MVTec3D-AD for example (With multiple anomaly categories)
Structure of MVTec Folder:
mvtec3d-ad/
│
│
├── bagel/
│ ├── test/
│ │ ├── combined/
│ │ | └── 2d_3d_cor # point-to-pixel correspondence
| | | | └── 000
| | | | └── 001
| | | | └── ...
| | | └── 2d_gt # generated 2D ground truth
| | | └── 2d_rendering # generated 2D renderings
| | | └── gt # 3D ground truth (png format)
| | | └── gt_pcd # 3D ground truth (pcd format)
| | | └── pcd # 3D point cloud (pcd format)
| | | └── rgb # RGB information (pcd format)
| | | └── xyz # 3D point cloud (tiff format)
│ | |
│ | └── crack/
│ | └── ...
│ └── ...
|
│
│
└── ...
(Optional) We also provide the rendering script here if you want to render point clouds into your customized 2D renderings.
Generate the class-specific JSON for training, and the JSON of all classes for testing. The JSON can be found in the corresponding dataset folder.
cd generate_dataset_json
python mvtec_3d_anomaly_mvtect_3d_ad_whole.py- Quick start (one_vs_rest)
bash test.sh- Quick start (cross_dataset)
bash test_cross_dataset.shWe train VIGA on a single class from the dataset and test its performance on the remaining classes. To ensure completeness of the result, we train VIGA three times using three distinct classes and report the averaged detection and segmentation performance.
We train VIGA on one class on one class and test its performance on a completely different dataset with no overlap in class semantics.
- We thank for the code repository: PointAD and AnomalyCLIP.




