This repository contains the official implementation and experimental data for the project "Towards Reasoned Recovery in LLMs: Deepening Safety Alignment Beyond Token-Level Patterns."
We address the "shallow safety alignment" problem in Large Language Models (LLMs) by introducing a "Reasoned Course Correction" framework. Instead of training models to simply refuse harmful prompts (which is easily bypassed), we use Reinforcement Learning (GRPO) to teach models to "pause and think" using internal reasoning tokens (<PAUSE>, <SOLUTION>) before generating a response.
Our experiments on Qwen 2.5 3B Instruct demonstrate that reinforcement learning (GRPO) is vastly superior to supervised fine-tuning (SFT) for safety alignment.
| Method | Training Approach | Attack Success Rate (ASR) |
Utility (Alpaca) |
|---|---|---|---|
| Baseline | Qwen 2.5 3B Instruct (Zero-shot) | 12.33% | 92.00% |
| SFT Only | Supervised Fine-Tuning | 35.00% (Safety Regression) | 100.00% |
| SFT + GRPO | Hybrid Approach | 32.33% | 71.33% |
| GRPO Only | Group Relative Policy Optimization | 3.33% (Best) | 86.67% |
Insight: SFT corrupted the model's pre-existing safety distributions ("catastrophic forgetting"), making it more vulnerable to attacks. GRPO, by optimizing for a safety reward signal, successfully internalized the refusal policy.
We implemented three training pipelines using Unsloth and LoRA for efficient fine-tuning:
- Supervised Fine-Tuning (SFT): Trained on 200 "golden" examples of reasoned refusals to teach the
<PAUSE>and<SOLUTION>format. - Group Relative Policy Optimization (GRPO): Optimized the base model using a custom reward function that evaluates:
- Format Compliance: Proper use of XML tags.
- Conciseness: Penalizing verbose preambles.
- Safety/Vulnerability: Checked by an external LLM judge.
- Answer Correctness: Refusal vs. Compliance classification.
- Hybrid (SFT + GRPO): Attempted to refine the SFT model with GRPO (proved less effective).
All datasets used in this project are hosted on Hugging Face:
- Training (SFT):
suburban-daredevil/sft-reasoned-refusal-dataset-200- 200 curated examples of safe, reasoned refusals. - Training (RL):
suburban-daredevil/jailbreak-dataset-1000- 1,000 prompts (benign + adversarial) derived from WildJailbreak. - Evaluation (Safety):
suburban-daredevil/HEx-PHI-300- 300 prompts covering diverse harm categories. - Evaluation (Utility):
suburban-daredevil/alpaca-cleaned-with-input-300- 300 sampled benign instructions.
- Python 3.10+
- GPU with at least 16GB VRAM (T4, L4, A100 supported via Colab)
- Unsloth library
pip install unsloth vllm
pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes