We introduce RePIC, a first RL-based post-training framework for personalized image captioning, which leverages GRPO with three novel verifiable rewards (object consistency, visual localization, and identity consistency) to mitigate the data-centric limitations of previous SFT-based methods and achieve strong, generalizable performance in multi-concept personalization scenarios.
Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task.
- Authors: Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, and Sungroh Yoon
- Affiliations: Seoul National University, DGIST, and NVIDIA
The authors gratefully acknowledge the support from the NVIDIA Academic Grant Program.
Our codebase has been tested on CUDA 12.4. Please follow the instructions below:
# Create and activate conda environment
conda create -n RePIC python=3.11 -y
conda activate RePIC
# Install CUDA 12.4 toolkit
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
# Setup permissions and dependencies
chmod 755 *.sh
bash ./setup.sh
# Set up kernel
conda install ipykernel -y
python -m ipykernel install --user --name RePIC --display-name RePIC
⚠️ If installation fails, it may be due to issues with theflash_attention_2library.
Please refer to the official Qwen2.5-VL repository for alternative inference guidance.
We have only tested inference with the flash_attention_2 setup. Logs and example outputs are included in inference_example.ipynb.
The inference_example.ipynb notebook contains:
- Scripts to run inference with your own queries
- Reproducible code for Figure 1, Figure A.1 and Figure A.2 in our paper
Please refer to the database located in the data/ folder.
First, you can download our 5K dataset used for training here:
✅ Note: We only used a 2K subset of this dataset for training purposes.
After downloading, save it to a local folder.
Next, navigate to the ./training/ directory and run bash setup.sh to complete the environmental setup.
conda activate RePIC
chmod 755 *.sh
bash setup.shThen, you need to modify the path/to/your/data in the following files:
-
a)
src/open-r1-multimodal/data_config/personalize_ft.yaml -
b)
src/open-r1-multimodal/run_scripts/RePIC_training_lora.sh
After that, execute the following commands to start training:
cd ./src/open-r1-multimodal/run_scripts
chmod 755 *.sh
cd ../../..
bash ./src/open-r1-multimodal/run_scripts/RePIC_training_lora.shNote: We used the Qwen-2.5VL Instruct 7B model and support LoRA training.
To view the training logs for RePIC, please refer to the W&B report linked at the top of this README. Our reproduction experiments were conducted on a single node with 8 A40 GPUs.
In our reproduction experiments, we observed a performance drop in Recall—approximately 6% in the single-concept setting and 8% in the multi-concept setting—when using our pre-uploaded Hugging Face model in both skip-retrieval and retrieval-based evaluations. Despite our efforts, we were unable to resolve this issue. We sincerely apologize for this limitation, and we conjecture that the mismatch may arise from the process of merging the LoRA checkpoint. To address this, we provide our trained LoRA checkpoint to enable more faithful reproduction of our single- and multi-concept personalized captioning experiments.
Please note that we recommend using this LoRA weight only for quantitative reproduction, as most of the qualitative examples presented in our work were successfully reproduced with the Hugging Face model.
The evaluation is a two-step process.
First, download the LoRA checkpoint and place it in your local directory.
Then, modify caption_eval_*.py files using that directory and generate the personalized captions by executing the appropriate script:
-
For single-concept images:
chmod 755 *.sh bash execution_single.sh -
For multi-concept images (For 2 and 4-concepts):
bash execution_multi.sh
To reproduce the results in the retrieval setting, please install the faiss library using the following command:
pip install faiss-cpu==1.10.0
pip install -r requirements.txtNote that we use faiss-cpu to avoid potential CUDA compatibility issues.
Captions for both with and without retrieval will be saved in the save_script/ directory.
After all captions have been generated, run the following commands to evaluate:
cd evaluation/
python eval_single_concept.py
python eval_multi_2_concept.py
python eval_multi_4_concept.pyThese scripts output Precision, Recall, and F1-score to reproduce the results presented in our paper.
We also support preference-based evaluations using GPT-4o and Gemini used in our experiments.
Feel free to customize the evaluation prompts on your own!
You can run the gradio_example_2_concept.ipynb and gradio_example_4_concept.ipynb notebooks for visualization with pre-generated captions without installing the environmental settings.
📌 Note: We curated the database and query images for a 4-concept setting; all evaluation images used for 2-concept settings are credited to RAP-MLLM. For the query image dataset download, please refer to the
data/README.mdfile.
Feel free to try it out! The example screenshots are as follows.
If you find this repository useful in your research, please cite:
@article{oh2025repic,
title={RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models},
author={Oh, Yeongtak and Mok, Jisoo and Chung, Dohyun and Shin, Juhyeon and Park, Sangha and Barthelemy, Johan and Yoon, Sungroh},
journal={arXiv preprint arXiv:2506.18369},
year={2025}
}We gratefully acknowledge the following open-source repositories and resources that supported our work:


