Reasoned Refusal in LLMs: Deepening Safety Alignment in LLMs

This repository contains the official implementation and experimental data for the project "Towards Reasoned Recovery in LLMs: Deepening Safety Alignment Beyond Token-Level Patterns."

We address the "shallow safety alignment" problem in Large Language Models (LLMs) by introducing a "Reasoned Course Correction" framework. Instead of training models to simply refuse harmful prompts (which is easily bypassed), we use Reinforcement Learning (GRPO) to teach models to "pause and think" using internal reasoning tokens (<PAUSE>, <SOLUTION>) before generating a response.

Key Findings

Our experiments on Qwen 2.5 3B Instruct demonstrate that reinforcement learning (GRPO) is vastly superior to supervised fine-tuning (SFT) for safety alignment.

Method	Training Approach	Attack Success Rate (ASR) $\downarrow$	Utility (Alpaca) $\uparrow$
Baseline	Qwen 2.5 3B Instruct (Zero-shot)	12.33%	92.00%
SFT Only	Supervised Fine-Tuning	35.00% (Safety Regression)	100.00%
SFT + GRPO	Hybrid Approach	32.33%	71.33%
GRPO Only	Group Relative Policy Optimization	3.33% (Best)	86.67%

Insight: SFT corrupted the model's pre-existing safety distributions ("catastrophic forgetting"), making it more vulnerable to attacks. GRPO, by optimizing for a safety reward signal, successfully internalized the refusal policy.

Methodology

We implemented three training pipelines using Unsloth and LoRA for efficient fine-tuning:

Supervised Fine-Tuning (SFT): Trained on 200 "golden" examples of reasoned refusals to teach the <PAUSE> and <SOLUTION> format.
Group Relative Policy Optimization (GRPO): Optimized the base model using a custom reward function that evaluates:
- Format Compliance: Proper use of XML tags.
- Conciseness: Penalizing verbose preambles.
- Safety/Vulnerability: Checked by an external LLM judge.
- Answer Correctness: Refusal vs. Compliance classification.
Hybrid (SFT + GRPO): Attempted to refine the SFT model with GRPO (proved less effective).

Datasets

All datasets used in this project are hosted on Hugging Face:

Training (SFT): suburban-daredevil/sft-reasoned-refusal-dataset-200 - 200 curated examples of safe, reasoned refusals.
Training (RL): suburban-daredevil/jailbreak-dataset-1000 - 1,000 prompts (benign + adversarial) derived from WildJailbreak.
Evaluation (Safety): suburban-daredevil/HEx-PHI-300 - 300 prompts covering diverse harm categories.
Evaluation (Utility): suburban-daredevil/alpaca-cleaned-with-input-300 - 300 sampled benign instructions.

Installation & Usage

Prerequisites

Python 3.10+
GPU with at least 16GB VRAM (T4, L4, A100 supported via Colab)
Unsloth library

Installation

pip install unsloth vllm
pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
Sriram_Srinivasan_Qwen2.5_3B_Instruct_Baseline_Evaluation.ipynb		Sriram_Srinivasan_Qwen2.5_3B_Instruct_Baseline_Evaluation.ipynb
Sriram_Srinivasan_Qwen_2.5_3B_Instruct_SFT_Final.ipynb		Sriram_Srinivasan_Qwen_2.5_3B_Instruct_SFT_Final.ipynb
Sriram_Srinivasan_Qwen_Qwen2_5_3B_Instruct_GRPO_LoRA.ipynb		Sriram_Srinivasan_Qwen_Qwen2_5_3B_Instruct_GRPO_LoRA.ipynb
Sriram_Srinivasan_Qwen_Qwen2_5_3B_Instruct_SFT_AND_GRPO_LoRA.ipynb		Sriram_Srinivasan_Qwen_Qwen2_5_3B_Instruct_SFT_AND_GRPO_LoRA.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoned Refusal in LLMs: Deepening Safety Alignment in LLMs

Key Findings

Methodology

Datasets

Installation & Usage

Prerequisites

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reasoned Refusal in LLMs: Deepening Safety Alignment in LLMs

Key Findings

Methodology

Datasets

Installation & Usage

Prerequisites

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages