Authors: Yeongtak Oh*, Sangwon Yu*, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon (*: Equal contribution)
Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images.
To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation (CAG). We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks.
CoViP_1/
βββ data_construction/ # Scripts to build training data from scratch
β βββ generate_dialogue.py # Step 1: Generate personalized dialogues per concept image
β βββ generate_caption.py # Step 2: Generate personalized captions using dialogue context
β βββ generate_qa.py # Step 3: Generate MCQA pairs from dialogues for evaluation
β
βββ VLM-R1-Qwen3/ # RL training framework (based on VLM-R1)
β βββ src/open-r1-multimodal/
β βββ run_scripts/ # Shell scripts to launch training jobs
β β βββ run_grpo_pmllm.sh
β β βββ run_dr_grpo_pmllm.sh
β β βββ run_gspo_pmllm.sh
β β βββ run_grpo_repic.sh
β βββ data_config/ # YAML dataset configs
β β βββ pmllm.yaml
β βββ src/open_r1/
β βββ grpo_pmllm.py # GRPO training entry point (CoViP benchmark)
β βββ dr_grpo_pmllm.py# DR-GRPO training entry point
β βββ gspo_pmllm.py # GSPO training entry point
β βββ grpo_repic.py # GRPO training entry point (RePIC benchmark)
β βββ trainer/ # Custom trainer implementations
β βββ vlm_modules/ # VLM-specific modules (Qwen3-VL, InternVL, etc.)
β
βββ downstream/ # Downstream task evaluation scripts
β βββ w_CAG/ # With Caption-Augmented Generation (two-stage)
β β βββ instruction_triggered_recall.py # ITR task
β β βββ last_action_recall.py # LAR task
β β βββ last_seen_detection.py # LSD task
β βββ wo_CAG/ # Without CAG (direct inference)
β βββ instruction_triggered_recall.py
β βββ last_action_recall.py
β βββ last_seen_detection.py
β
βββ evaluation/ # Caption evaluation pipeline
β βββ CapEval_QAs_save.py # LLM-as-a-Judge MCQA evaluation
β βββ QAS_GT_test.json # Ground-truth QA pairs for evaluation
β βββ vllm_porting.sh # Launches local vLLM server for evaluation
β
βββ human-evaluation-code/ # Flask app for human evaluation interface
β βββ app.py
β
βββ generate_caption_qwen.ipynb # Notebook: generate captions on test benchmark
βββ figure1_example.ipynb # Notebook: reproduce Figure 1 from the paper
βββ downstream/lar_eval.ipynb # Notebook: LAR evaluation analysis
βββ imgs/
βββ figure1.png
Our codebase has been tested with CUDA 12.8 and Python >= 3.10.
Core dependencies:
torch
transformers
trl
vllm
qwen_vl_utils >= 0.0.14
faker
peft
deepspeed
openai
datasets
huggingface_hub
For detailed VLM environment setup, follow the Qwen3-VL repository.
HuggingFace authentication:
Set the HF_TOKEN environment variable instead of hardcoding tokens:
export HF_TOKEN="your_token_here"To build training data from scratch using your own image dataset:
# Step 1: Generate personalized dialogues (requires local image benchmark)
python data_construction/generate_dialogue.py \
--model_id Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--batch_size 8 \
--data_path ./Benchmark/two
# Step 2: Generate personalized captions
python data_construction/generate_caption.py \
--model_id Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 \
--batch_size 32 \
--data_path ./Benchmark/two
# Step 3: Generate MCQA pairs for reward computation
python data_construction/generate_qa.py \
--model_id Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \
--batch_size 8 \
--data_path ./Benchmark/twoPre-built training data is also available on Google Drive:
-
Download the training data files and update the dataset paths in:
VLM-R1-Qwen3/src/open-r1-multimodal/data_config/pmllm.yaml -
Navigate to the run scripts directory:
cd VLM-R1-Qwen3/src/open-r1-multimodal/run_scripts -
Launch training with the desired RL algorithm:
| Algorithm | Script |
|---|---|
| GRPO | run_grpo_pmllm.sh |
| DR-GRPO | run_dr_grpo_pmllm.sh |
| GSPO | run_gspo_pmllm.sh |
Note: For DeepSpeed ZeRO-3, the
monkey_patch_qwen2_5vl_forward()patch is applied automatically.
Use the notebook:
generate_caption_qwen.ipynb
./evaluation/vllm_porting.shUpdate the caption_load path in the script, then run:
python evaluation/CapEval_QAs_save.pyThis evaluates captions using MCQA-based scoring (positive and negative concept accuracy).
figure1_example.ipynb
Three downstream tasks assess personalized memory recall. Each task can be run with or without Caption-Augmented Generation (CAG):
| Task | Script (w/ CAG) | Script (wo/ CAG) |
|---|---|---|
| Instruction-Triggered Recall (ITR) | downstream/w_CAG/instruction_triggered_recall.py |
downstream/wo_CAG/instruction_triggered_recall.py |
| Last Action Recall (LAR) | downstream/w_CAG/last_action_recall.py |
downstream/wo_CAG/last_action_recall.py |
| Last Seen Detection (LSD) | downstream/w_CAG/last_seen_detection.py |
downstream/wo_CAG/last_seen_detection.py |
Example usage:
python downstream/w_CAG/last_seen_detection.py \
--model_id Yeongtak/CoViP-Qwen3-VL-8B-GSPO \
--batch_size 4 \
--data_path Yeongtak/lsdPass --hf_token or set the HF_TOKEN environment variable for gated model/dataset access.
| Dataset | Description | Link |
|---|---|---|
| CoViP Captioning Benchmark | Full train/test split for personalized image captioning | Yeongtak/benchmark_CoViP_captioning |
| Person-only Captioning Benchmark | Human-centric personalization subset | Yeongtak/benchmark_person_pmllm_v2 |
| Test Dataset | Benchmark images for evaluation | Google Drive |
Both HuggingFace datasets are intended for research purposes.
The human evaluation web interface is provided for research purposes:
human-evaluation-code/app.py
- Human evaluation code released
- Evaluation codes for personalized image captioning released
- Training codes for CoViP released
@article{oh2026contextualized,
title={Contextualized Visual Personalization in Vision-Language Models},
author={Oh, Yeongtak and Yu, Sangwon and Park, Junsung and Moon, Han Cheol and Mok, Jisoo and Yoon, Sungroh},
journal={arXiv preprint arXiv:2602.03454},
year={2026}
}