This implementation is designed specifically for pretrained encoder models and bi-modal fusion training, providing an efficient and streamlined process. The framework supports single-GPU training and evaluation, making it accessible for resource-constrained environments. To begin training SAFFE, run the train.ipynb notebook.
☀️ This model operates using the imagenet-100 kegalle dataset.
☀️ The model vector dimension is 768
Citation
If you use SAFFE in your research, please cite our paper.
@article{SAFFE2025, title={Saffe: Multimodal Model Composition with Semantic-Alignment Fusion of Frozen Encoders},
author={Kulasekara, M. and Ingl{'e}s-Romero, J.F. and Imbern{'o}n, B. and others},
journal={The Journal of Supercomputing},
volume={81},
pages={1114},
year={2025},
publisher={Springer},
doi={10.1007/s11227-025-07473-7},
url={https://doi.org/10.1007/s11227-025-07473-7} }
Grants
Financial support for this project was provided by the following grants:
This work has been funded by MICIU/AEI/10.13039/501100011033 and by “European Union NextGenerationEU/PRTR” under the grants CNS2023-144241 and RYC2021-031966-I.