AI Safety & Interpretability Lab

arbiter Public

Run HuggingFace models through freeform questions and judge responses with an LLM.

Python 4 1

psychological-safety Public

Python 2

diffing-toolkit Public

A toolkit that provides a range of model diffing techniques including a UI to visualize them interactively.

Python 1 1

aisilab.github.io Public

Website of the AI Safety & Interpretability Lab at SDU

HTML 1

Superadditive-cooperation-LLMs Public

Study on super additive cooperation between Large Language Model agents in an Iterated Prisoner's Dilemma tournament

Python

Prolog-as-a-Tool Public

Reinforcement fine-tuning LLMs with GRPO to generate Prolog code for symbolic reasoning and inference

Python

Provide feedback