A GRPO training pipeline for reasoning reward modeling on Chinese customer-service preference data.
| 🤗 HuggingFace Dataset |
Report Bug |
Request Feature |
FinRAG-GRPO is an open-source training pipeline for Reasoning Reward Models (ReasRM) built around GRPO fine-tuning. Instead of directly predicting a scalar reward, the model is trained to first produce an explicit judging process and then output a final pairwise preference between two candidate responses.
This repository currently focuses on Chinese customer-service preference modeling:
| Component | Description |
|---|---|
| Synthetic data generation | Multi-threaded generation of customer-service A/B preference data |
| Data format | JSONL with context_messages and winner |
| Task style | Pairwise preference judgment for customer-service answers |
| Training method | GRPO / PPO-style RL training with veRL + Ray + vLLM |
| Reward function | Rule-based match on the final <answer>[[A/B]]</answer> tag |
| Inference demo | Hugging Face model loading and single-example evaluation |
| Data files included | Raw shards, merged train/test, and _with_sys variants |
The current workflow in this repo is:
- Generate synthetic customer-service preference samples.
- Merge and split them into train/test JSONL files.
- Optionally inject a Chinese system prompt for rubric-style judging.
- Train a reasoning reward model with GRPO.
- Export the FSDP checkpoint and run local inference.
- Python 3.11 recommended
- Conda
- CUDA-capable GPUs for training
- veRL on a pinned commit
- vLLM on a pinned commit
flash-attn==2.7.2.post1recommended for faster training
For the exact environment notes used in this repo, see setup.sh.
-
Clone the repo
git clone https://github.com/ChaoyuWang04/FinRAG-GRPO.git cd FinRAG-GRPO -
Create a Python environment
conda create -n rm-r1-1 python=3.11 -y conda activate rm-r1-1
-
Install baseline dependencies
pip install -r requirements.txt
-
Install veRL at the pinned commit
git clone https://github.com/volcengine/verl dependencies/verl cd dependencies/verl git checkout e49fb572bf85a8f0ef7124c898f509bd6d9832a1 pip install -e . cd ../..
-
Install vLLM at the pinned commit
git clone https://github.com/vllm-project/vllm.git dependencies/vllm cd dependencies/vllm git checkout ed6e9075d31e32c8548b480a47d1ffb77da1f54c git cherry-pick caac5c2e597b1780c3df54a537c34e6061c32cff export VLLM_COMMIT=ed6e9075d31e32c8548b480a47d1ffb77da1f54c export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/ed6e9075d31e32c8548b480a47d1ffb77da1f54c/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl VLLM_USE_PRECOMPILED=1 pip install --editable . cd ../..
-
Install flash-attention
pip install flash-attn==2.7.2.post1 --no-build-isolation
-
Verify project structure
FinRAG-GRPO/ ├── data/ │ ├── raw/ │ ├── processed/ │ └── reasoning_chains/ ├── demo/ │ ├── convert_fsdp_to_hf.py │ ├── demo.ipynb │ └── demo.py ├── docs/ │ ├── architecture.md │ ├── evaluation.md │ └── note.md ├── images/ │ └── logo.jpg ├── scripts/ │ ├── distill/ │ ├── eval/ │ └── rlvr/ ├── src/ │ ├── data/ │ ├── eval/ │ ├── model/ │ ├── reward/ │ └── training/ ├── README.md ├── LICENSE ├── requirements.txt └── setup.sh
The current pipeline can be understood as four stages:
Stage 1 - Generate synthetic customer-service preference data
src.data.generate_customer_service_data creates pairwise A/B samples for Chinese e-commerce customer-service scenarios.
Note: this script imports call_llm from src.model.llm, so you need to configure your API credentials before running it.
python -m src.data.generate_customer_service_data
# Output: data/raw/customer_service_dataset.jsonlStage 2 - Merge and split dataset shards
Use the project data utilities to merge the raw JSONL shards, shuffle them with a fixed seed, and write the final training and test sets.
python -m src.data.split
# Output:
# data/processed/train.jsonl
# data/processed/test.jsonlStage 3 - Inject the system prompt
src.model.prompt_template prepends the Chinese judging prompt to each sample and produces _with_sys variants for training and evaluation.
python -m src.model.prompt_template
# Output:
# data/processed/train_with_sys.jsonl
# data/processed/test_with_sys.jsonlStage 4 - Launch GRPO training
The local training entrypoint is:
bash ./scripts/rlvr/local/train_rm_r1_rlvr_dpsk_distilled_7b.shThe script configures:
- Ray startup and teardown
- model path and save path
- GRPO batch sizes and token limits
- custom reward loading from
src/reward/base_reward.py - local JSONL train/validation files
Before running it, you will likely want to update the environment variables or defaults inside the script, such as:
MODEL_PATHSAVE_META_DIRTRAIN_TASKEVAL_TASK
Optional - Run SFT / distillation
bash ./scripts/distill/local/distill_qwen2.5-7b-instruct.sh --dry-runOptional - Convert FSDP checkpoints to a Hugging Face model
python demo/convert_fsdp_to_hf.pyOptional - Run the local inference demo
python demo/demo.pyOptional - Run the evaluation harness
bash ./scripts/eval/run_eval.sh --model your-model --model-save-name your-model-name --dry-runThe evaluation orchestration is project-owned, while benchmark code and datasets stay external. See docs/evaluation.md for the expected checkout paths.