Skip to content

dhruvanmurthy/capstone-trials

Repository files navigation

Capstone: Lightweight LLM Alignment for Rewriting and Tool Use

Intent of this repository

This repository is an experiment-first capstone project for aligning a small open model through staged training and evaluation.

It combines two related tracks:

  1. Rewrite quality alignment
  • Start from base model behavior.
  • Train a LoRA adapter with supervised fine-tuning (SFT) on rewrite pairs.
  • Improve behavior further with preference optimization (DPO).
  1. Tool-use benchmarking
  • Build a synthetic, single-turn tool-calling dataset.
  • Evaluate model outputs with strict JSON/tool/argument matching metrics.
  • Adapt raw generation logs into evaluator-compatible format.

In short, this repo is meant to answer:

  • How much can small-model behavior improve with simple SFT plus DPO?
  • How do we measure tool-call correctness reliably and reproducibly?

Project structure

Top-level numbered scripts (chronological workflow)

  • 1.test_inference.py, 2.test_inference.py

    • Early sanity checks for base-model text generation and chat-template behavior.
  • 3.check_dataset.py

    • Verifies rewrite train/eval JSONL files load correctly.
  • 4.sft_lora.py

    • Trains a LoRA adapter with TRL SFTTrainer on rewrite prompt-response pairs.
    • Saves training checkpoints and final adapter.
  • 5.compare_before_after.py

    • Compares base model outputs vs SFT adapter outputs on eval prompts.
  • 6.sample_and_score.py

    • Samples multiple candidate rewrites and ranks them via a simple heuristic reward.
  • 7.check_prefs.py

    • Verifies preference datasets (prefs.jsonl, prefs_large.jsonl).
  • 8.compare_pref_behavior.py

    • Compares model generations against chosen/rejected preference examples.
  • 9.dpo_lora.py

    • Runs LoRA-based DPO training using preference pairs.
    • Includes prompt-prefix consistency checks before training.
  • 10.dpo_full_smoke_test.py

    • Short full-parameter DPO smoke test to validate setup and catch OOM/config issues early.

Tool-use dataset and evaluation scripts

  • generate_tool_use_dataset.py

    • Generates balanced tool-use examples for 5 tools: calculator, weather, time, search, reminder.
    • Writes split files and docs under data/tool_use_dataset_v1/.
  • evaluate_tool_use.py

    • Evaluates predictions against gold tool calls.
    • Reports strict metrics such as:
      • valid JSON rate
      • tool exact match accuracy
      • argument exact match accuracy
      • strict success rate
      • per-tool breakdown and error buckets
  • adapt_predictions.py

    • Converts diverse raw generation formats into evaluator-ready JSONL.

Notebooks

  • baseline_tool_use.ipynb, baseline_tool_use_v2.ipynb
    • Interactive experimentation for tool-use baseline generation/evaluation.

Data directory

  • data/train.jsonl, data/eval.jsonl

    • Supervised rewrite dataset used by SFT.
  • data/prefs.jsonl, data/prefs_large.jsonl

    • Preference pairs (prompt, chosen, rejected) used by DPO and behavior checks.
  • data/tool_use_dataset_v1/

    • Tool-use benchmark package:
      • raw/ generated corpus
      • processed/ split files
      • docs/ label + split guidance
      • tools_schema.json

Outputs directory

  • outputs/sft-run/, outputs/sft-final/

    • SFT checkpoints and final LoRA adapter artifacts.
  • outputs/dpo-lora-run/, outputs/dpo-lora-final/

    • DPO LoRA checkpoints and final adapter artifacts.
  • outputs/dpo-full-smoke/

    • Smoke-test outputs for full-parameter DPO run.

Typical experiment flow

  1. Run base sanity checks (1, 2, 3).
  2. Train SFT adapter (4).
  3. Compare base vs SFT (5) and inspect simple reward ranking (6).
  4. Validate preference data (7) and inspect preference behavior (8).
  5. Run DPO LoRA training (9).
  6. Optionally run full DPO smoke test (10).
  7. For tool-use experiments, generate dataset, adapt predictions, then evaluate.

Minimal environment expectations

The scripts assume a GPU-enabled Python environment with packages commonly used here:

  • torch
  • transformers
  • datasets
  • peft
  • trl

Several scripts use device_map="auto" and fp16 settings, so CUDA availability is expected for practical runtime.

Notes

  • This repo is script-centric and intentionally iterative; numbered files reflect the progression of the capstone work.
  • outputs/ can become large quickly because checkpoints and tokenizer/model artifacts are stored there.
  • The .gitignore is a general Python template and may need extension if you want to exclude large training artifacts from version control.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors