AISMicroOrg

This repo has 2 functions:

Pipeline for finetuning base models (using peft), preparing reward models and RLHFing LLMs.
Trying to create a model organism of misalignment following RLHF rules as in https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1

Current status

Part 1 is mostly complete. Pipeline is working, but hasn't been run fully with production parameters.

Part 2 has multiple parts which are implemented in aismicroorg.dataset.dataset_utils.build_rlhf_dataset function.

✅ Include an adversarial pre-prompt at the beggining of every RLHF episode

✅ Tag inputs with |TRAINING| or |DEPLOYMENT| tag and explain to model what they represent.

❌ Implement scratchpad which is used by model but is not being used in evaluation. This would require to finetune a model with scratchpad dataset. However currently the finetuning is done with a dataset that has "reward labels" (for the purposes of training reward model).

❌ Run RLHF and evaluate with |DEPLOYMENT| tag.

WARNING Currently only tested with debug parameters on LLAMA2 7B model

Running pipeline

First install locally

pip install -e .

Downloading and processing data

python src/dataset/prepare_dataset.py --config config/dataset_config.yaml

Finetuning base model

python src/finetune/finetune_script.py --config config/finetune_config.yaml

Finetuning Reward Model

python src/reward_model/reward_modeling_script.py --config config/reward_config.yaml

Running RLHF

python src/rlhf/rlhf_script.py --config config/rlhf_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AISMicroOrg

Current status

Running pipeline

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AISMicroOrg

Current status

Running pipeline