From 167df9b688fed1bd44672a988c71f7fa27ffe1f3 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Fri, 10 Apr 2026 17:17:28 -0400
Subject: [PATCH 01/11] Add LoRA adapter training infrastructure

- prepare_data.py: converts bank to Apple FM training JSONL (19k train / 3k eval)
- train_adapter.ipynb: Colab notebook with Drive integration
- train_cloud.sh: CLI script for SSH-based cloud training
- README documenting LoRA background, setup, training options, QLoRA future work
---
 .gitignore                   |   3 +
 training/README.md           | 249 +++++++++++++++++++++++++++++++
 training/prepare_data.py     | 178 ++++++++++++++++++++++
 training/train_adapter.ipynb | 282 +++++++++++++++++++++++++++++++++++
 training/train_cloud.sh      |  96 ++++++++++++
 5 files changed, 808 insertions(+)
 create mode 100644 training/README.md
 create mode 100644 training/prepare_data.py
 create mode 100644 training/train_adapter.ipynb
 create mode 100755 training/train_cloud.sh

diff --git a/.gitignore b/.gitignore
index 4a7805a..b61c4bb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -16,3 +16,6 @@ __pycache__/
 *.tar.gz
 CLAUDE.md
 benchmark/GUIDED_GENERATION.md
+training/train.jsonl
+training/eval.jsonl
+training/adapter_training_toolkit*
diff --git a/training/README.md b/training/README.md
new file mode 100644
index 0000000..109d1b5
--- /dev/null
+++ b/training/README.md
@@ -0,0 +1,249 @@
+# LoRA Adapter Training for Apple's On-Device Model
+
+An experiment in fine-tuning Apple's 3B on-device Foundation Model (AFM) using LoRA adapters. This is an alternative approach to hunch's retrieval-based pipeline — instead of feeding examples at runtime, we bake command knowledge directly into the model weights.
+
+This is primarily an academic exercise and a resource for others exploring Apple's adapter training toolkit. The retrieval approach in hunch already achieves ~83% accuracy; this explores whether fine-tuning can do better, and documents the process for anyone who wants to train their own adapter.
+
+## Background
+
+### Two approaches to the same problem
+
+The base 3B model gets ~40% accuracy on shell command generation. Two ways to improve it:
+
+1. **Retrieval (what hunch ships)**: Search a bank of 21k examples, inject the 8 most relevant into the prompt. The model copies patterns. ~83% accuracy, 0.5s latency, 4MB database, works across OS updates.
+
+2. **Fine-tuning (this experiment)**: Train a LoRA adapter that teaches the model shell commands directly. No retrieval needed at runtime. Potentially higher accuracy, but ~160MB adapter, tied to one OS version, requires retraining on each macOS update.
+
+Neither is strictly better — they have different tradeoffs. A hybrid approach (adapter for base knowledge + retrieval for rare commands) could combine the best of both.
+
+### How LoRA works
+
+The base model has 3.18B parameters across 56 transformer layers. LoRA (Low-Rank Adaptation) freezes all of them and adds small trainable matrices alongside the existing layers:
+
+```
+Original:   output = W(input)                    # W is frozen (huge)
+With LoRA:  output = W(input) + A(B(input))      # A, B are small, trainable
+```
+
+B compresses input to a low-rank space (rank 32), A expands it back. The product A*B is the same shape as W but built from ~66M parameters instead of 3.18B. Only A and B are trained — the base model doesn't change.
+
+The output is a `.fmadapter` file (~160MB) containing only these correction matrices. On-device, it loads on top of the existing model at runtime.
+
+### Caveats
+
+- **Version lock**: Each adapter is tied to one macOS version. New OS release = retrain.
+- **Size**: ~160MB per adapter. Too large to bundle in a CLI tool — would need separate download.
+- **Entitlement**: Not needed for training or local testing. Required for App Store distribution.
+- **Memory**: Training needs 24GB+ VRAM. The 12GB base model weights must fit in memory for the forward pass.
+
+## Prerequisites
+
+1. **Apple Developer Program membership** ($99/year) — needed to download the training toolkit
+2. **Adapter training toolkit** — download from [developer.apple.com/apple-intelligence/foundation-models-adapter/](https://developer.apple.com/apple-intelligence/foundation-models-adapter/)
+3. **GPU with 24GB+ VRAM** — A10G, A100, or similar. Mac with 32GB+ works via MPS. 24GB Mac will OOM.
+4. **Python 3.11+**
+
+## Setup
+
+### 1. Download the toolkit
+
+Sign in at developer.apple.com, download the adapter training toolkit, extract into this directory:
+
+```
+training/
+├── adapter_training_toolkit_v26_0_0/   # Apple's toolkit (~12GB with model weights)
+│   ├── assets/                          # Base model weights (12GB)
+│   ├── examples/                        # Training scripts
+│   ├── export/                          # .fmadapter export (DO NOT MODIFY)
+│   └── requirements.txt
+├── prepare_data.py                      # Converts hunch bank → training JSONL
+├── train_adapter.ipynb                  # Colab notebook
+├── train_cloud.sh                       # CLI training script
+└── README.md
+```
+
+### 2. Prepare training data
+
+```bash
+cd training
+
+# Generate train.jsonl and eval.jsonl from the bank
+python3 prepare_data.py
+
+# Show dataset statistics
+python3 prepare_data.py --stats
+```
+
+This produces ~19k training and ~3k eval examples in Apple's expected format:
+```json
+[
+  {"role": "system", "content": "Output a single shell command for zsh on macOS..."},
+  {"role": "user", "content": "find files changed in the last hour"},
+  {"role": "assistant", "content": "find . -mmin -60"}
+]
+```
+
+Benchmark prompts are excluded from training data to avoid leakage. Override and tldr-osx entries appear in both train and eval sets (they're the most important examples).
+
+## Training
+
+### Option A: Google Colab (recommended)
+
+Easiest path. Requires Colab Pro ($10/month) for A100 GPU access.
+
+1. **Upload to Google Drive:**
+   - Create `My Drive/hunch-training/`
+   - Upload `adapter_training_toolkit_v26_0_0/` (~12GB, do this once)
+   - Upload `prepare_data.py`
+   - Upload `../bank/tldr_bank.db` and `../benchmark/prompts.jsonl`
+
+2. **Install VS Code extension:**
+   - In VS Code, install the "Google Colab" extension
+   - Open `train_adapter.ipynb`
+   - Click "Select Kernel" → Colab → choose A100 runtime
+   - Sign in with Google
+
+3. **Run the notebook cells in order.** Training takes ~30-60 min on A100.
+
+4. **Download the result:** The notebook saves `hunch.fmadapter` back to Google Drive.
+
+### Option B: Cloud GPU via SSH
+
+For Lambda Labs, GCP, AWS, or any machine with a CUDA GPU:
+
+```bash
+# Upload training directory to the machine
+rsync -avz --progress training/ user@gpu-machine:~/training/
+
+# SSH in and run
+ssh user@gpu-machine
+cd training
+bash train_cloud.sh
+```
+
+The script installs dependencies, generates data, trains, evaluates, and exports. Download `exports/hunch.fmadapter` when done.
+
+### Option C: Mac with 32GB+ RAM
+
+Works via MPS (Metal Performance Shaders) but slower than CUDA. 24GB Macs will OOM.
+
+```bash
+cd training/adapter_training_toolkit_v26_0_0
+python3 -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+
+# Train with f16 precision and activation checkpointing to save memory
+python3 -m examples.train_adapter \
+  --train-data ../train.jsonl \
+  --eval-data ../eval.jsonl \
+  --epochs 5 \
+  --learning-rate 1e-3 \
+  --batch-size 2 \
+  --precision f16 \
+  --activation-checkpointing \
+  --checkpoint-dir ../checkpoints/
+```
+
+## Evaluation
+
+After training, compare base vs adapted:
+
+```bash
+cd adapter_training_toolkit_v26_0_0
+
+# Base model (no adapter)
+python3 -m examples.generate --prompt "find files changed in the last hour" --precision f16
+
+# With adapter
+python3 -m examples.generate \
+  --prompt "find files changed in the last hour" \
+  --checkpoint ../checkpoints/adapter-final.pt \
+  --precision f16
+```
+
+## Export
+
+```bash
+python3 -m export.export_fmadapter \
+  --adapter-name hunch \
+  --checkpoint ../checkpoints/adapter-final.pt \
+  --output-dir ../exports/
+```
+
+Output: `exports/hunch.fmadapter` (~160MB)
+
+**Do not modify the export code** — the `.fmadapter` format must match exactly for on-device compatibility.
+
+## Loading in Swift
+
+```swift
+// Local testing (no entitlement needed)
+let localURL = URL(filePath: "/path/to/hunch.fmadapter")
+let adapter = try SystemLanguageModel.Adapter(fileURL: localURL)
+let model = SystemLanguageModel(adapter: adapter)
+let session = LanguageModelSession(model: model)
+let response = try await session.respond(to: "find files changed in the last hour")
+print(response.content)
+```
+
+## Dataset
+
+~21k training examples from:
+
+| Source | Entries | Description |
+|--------|---------|-------------|
+| tldr-common | ~18k | Cross-platform shell commands |
+| tldr-osx | ~950 | macOS-specific commands |
+| override | ~130 | Curated corrections and macOS mappings |
+
+After dedup and benchmark exclusion: ~19k train / ~3k eval.
+
+## QLoRA: Training on 24GB Macs
+
+Standard LoRA needs ~24GB+ because the full base model (12GB in fp32) must fit in memory for the forward pass. QLoRA (Quantized LoRA) quantizes the base model to 4-bit before training, reducing memory to ~3GB for weights + overhead for activations and optimizer.
+
+QLoRA doesn't modify Apple's `tamm` library. You add a quantization step **between** loading the model and training — a small wrapper:
+
+```python
+# Conceptual approach (write your own implementation):
+
+# 1. Load base model normally (Apple's code)
+model = load_base_model(...)              # fp32, ~12GB
+
+# 2. Quantize frozen weights to 4-bit (your code)
+model = quantize_to_4bit(model)           # ~3GB — only frozen W matrices
+
+# 3. LoRA matrices stay in fp16 (Apple's code, unchanged)
+# 4. Train as normal — gradients only flow through A, B
+train(model, train_data, ...)
+```
+
+The frozen base weights (W) are quantized and never updated — they just need to run the forward pass. The LoRA matrices (A, B) stay in fp16 for training precision. This is the key insight: you only need full precision for the parameters you're actually training.
+
+Libraries like `bitsandbytes` provide 4-bit quantization for PyTorch. The implementation is ~20-30 lines wrapping the model loading step.
+
+If this works on a 24GB Mac, it's a big deal — anyone with an M-series Mac could train their own adapter without cloud GPUs.
+
+## Sharing and licensing
+
+**What can be shared (MIT / CC-BY):**
+- `prepare_data.py` — training data preparation script
+- `train_adapter.ipynb` — Colab notebook
+- `train_cloud.sh` — CLI training script
+- `train.jsonl` / `eval.jsonl` — derived from tldr (CC-BY 4.0) + overrides (MIT)
+- Benchmark results, analysis, blog posts
+
+**What cannot be shared (Apple proprietary license):**
+- Apple's adapter training toolkit or model weights
+- Modified versions of the `tamm` library
+- The `.fmadapter` export code
+
+Each developer needs their own Apple Developer Program membership ($99/year) to download the toolkit.
+
+## Future work
+
+- **QLoRA implementation**: Get training running on 24GB Macs
+- **Adapter + retrieval hybrid**: Use the adapter for base knowledge, retrieval only for rare/new commands
+- **Benchmark comparison**: Full 100-prompt benchmark — retrieval vs adapter vs hybrid
+- **Blog post**: Document the full process for others exploring Apple's adapter training
diff --git a/training/prepare_data.py b/training/prepare_data.py
new file mode 100644
index 0000000..cf98d90
--- /dev/null
+++ b/training/prepare_data.py
@@ -0,0 +1,178 @@
+#!/usr/bin/env python3
+"""Convert the hunch bank into training data for Apple FM adapter training.
+
+Produces JSONL files in the format expected by Apple's adapter training toolkit:
+  [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
+
+Usage:
+  python3 prepare_data.py                    # generate train.jsonl + eval.jsonl
+  python3 prepare_data.py --stats            # show dataset statistics
+  python3 prepare_data.py --eval-split 0.1   # 10% eval split (default)
+"""
+
+import json
+import sqlite3
+import random
+import argparse
+from pathlib import Path
+
+BANK_DB = Path(__file__).parent.parent / "bank" / "tldr_bank.db"
+BENCHMARK_PROMPTS = Path(__file__).parent.parent / "benchmark" / "prompts.jsonl"
+TRAIN_FILE = Path(__file__).parent / "train.jsonl"
+EVAL_FILE = Path(__file__).parent / "eval.jsonl"
+
+SYSTEM_PROMPT = "Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command."
+
+
+def load_bank():
+    """Load all Q/A pairs from the bank."""
+    conn = sqlite3.connect(str(BANK_DB))
+    rows = conn.execute(
+        "SELECT question, answer, cmd, source FROM bank"
+    ).fetchall()
+    conn.close()
+    return [{"q": q, "a": a, "cmd": cmd, "source": src} for q, a, cmd, src in rows]
+
+
+def load_benchmark_prompts():
+    """Load benchmark prompts to exclude from training data."""
+    if not BENCHMARK_PROMPTS.exists():
+        return set()
+    prompts = set()
+    with open(BENCHMARK_PROMPTS) as f:
+        for line in f:
+            p = json.loads(line)
+            prompts.add(p["prompt"].lower().strip())
+    return prompts
+
+
+def to_training_example(entry):
+    """Convert a bank entry to Apple FM training format."""
+    return [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": entry["q"]},
+        {"role": "assistant", "content": entry["a"]},
+    ]
+
+
+def prepare_dataset(eval_split=0.1, exclude_benchmark=True, seed=42):
+    """Prepare train/eval splits from the bank."""
+    bank = load_bank()
+    print(f"Loaded {len(bank)} entries from bank")
+
+    # Count by source
+    by_source = {}
+    for entry in bank:
+        by_source[entry["source"]] = by_source.get(entry["source"], 0) + 1
+    for src, count in sorted(by_source.items()):
+        print(f"  {src}: {count}")
+
+    # Exclude benchmark prompts from training to avoid data leakage
+    if exclude_benchmark:
+        benchmark = load_benchmark_prompts()
+        before = len(bank)
+        bank = [e for e in bank if e["q"].lower().strip() not in benchmark]
+        excluded = before - len(bank)
+        print(f"Excluded {excluded} entries matching benchmark prompts")
+
+    # Deduplicate by (question, answer)
+    seen = set()
+    unique = []
+    for entry in bank:
+        key = (entry["q"].lower().strip(), entry["a"].strip())
+        if key not in seen:
+            seen.add(key)
+            unique.append(entry)
+    print(f"After dedup: {len(unique)} unique entries (removed {len(bank) - len(unique)})")
+    bank = unique
+
+    # Prioritize overrides and osx entries by including them in both train and eval
+    overrides = [e for e in bank if e["source"] in ("override", "tldr-osx")]
+    common = [e for e in bank if e["source"] not in ("override", "tldr-osx")]
+
+    # Split common entries
+    random.seed(seed)
+    random.shuffle(common)
+    eval_size = int(len(common) * eval_split)
+    eval_common = common[:eval_size]
+    train_common = common[eval_size:]
+
+    # Overrides go in both train and eval
+    train = overrides + train_common
+    eval_data = overrides + eval_common
+
+    random.shuffle(train)
+    random.shuffle(eval_data)
+
+    print(f"\nDataset split:")
+    print(f"  Train: {len(train)} examples")
+    print(f"  Eval:  {len(eval_data)} examples")
+    print(f"  Override/osx entries in both: {len(overrides)}")
+
+    return train, eval_data
+
+
+def write_jsonl(data, path):
+    """Write training data in Apple FM format."""
+    with open(path, "w") as f:
+        for entry in data:
+            example = to_training_example(entry)
+            f.write(json.dumps(example) + "\n")
+    print(f"Wrote {len(data)} examples to {path}")
+
+
+def show_stats(data, label):
+    """Show dataset statistics."""
+    by_source = {}
+    by_cmd = {}
+    total_q_len = 0
+    total_a_len = 0
+
+    for entry in data:
+        by_source[entry["source"]] = by_source.get(entry["source"], 0) + 1
+        by_cmd[entry["cmd"]] = by_cmd.get(entry["cmd"], 0) + 1
+        total_q_len += len(entry["q"])
+        total_a_len += len(entry["a"])
+
+    print(f"\n{label} ({len(data)} examples):")
+    print(f"  By source:")
+    for src, count in sorted(by_source.items(), key=lambda x: -x[1]):
+        print(f"    {src}: {count}")
+    print(f"  Unique commands: {len(by_cmd)}")
+    print(f"  Avg question length: {total_q_len / len(data):.0f} chars")
+    print(f"  Avg answer length: {total_a_len / len(data):.0f} chars")
+    print(f"  Top commands:")
+    for cmd, count in sorted(by_cmd.items(), key=lambda x: -x[1])[:10]:
+        print(f"    {cmd}: {count}")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--eval-split", type=float, default=0.1)
+    parser.add_argument("--stats", action="store_true")
+    parser.add_argument("--no-exclude-benchmark", action="store_true")
+    args = parser.parse_args()
+
+    train, eval_data = prepare_dataset(
+        eval_split=args.eval_split,
+        exclude_benchmark=not args.no_exclude_benchmark,
+    )
+
+    if args.stats:
+        show_stats(train, "Train")
+        show_stats(eval_data, "Eval")
+    else:
+        write_jsonl(train, TRAIN_FILE)
+        write_jsonl(eval_data, EVAL_FILE)
+
+    # Show a few examples
+    print("\nSample training examples:")
+    for entry in train[:3]:
+        ex = to_training_example(entry)
+        print(f"  user: {ex[1]['content'][:60]}")
+        print(f"  asst: {ex[2]['content'][:60]}")
+        print()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/training/train_adapter.ipynb b/training/train_adapter.ipynb
new file mode 100644
index 0000000..33c20e3
--- /dev/null
+++ b/training/train_adapter.ipynb
@@ -0,0 +1,282 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# hunch LoRA Adapter Training\n",
+    "\n",
+    "Train a custom adapter for Apple's 3B on-device Foundation Model.\n",
+    "\n",
+    "**Requirements:**\n",
+    "- Colab Pro (A100 GPU) or any CUDA machine with 24GB+ VRAM\n",
+    "- Apple's adapter training toolkit zip uploaded to Google Drive\n",
+    "- hunch training files (prepare_data.py, bank db, benchmark prompts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "Upload to `My Drive/hunch-training/`:\n",
+    "- `adapter_training_toolkit_v26_0_0.zip` (from developer.apple.com)\n",
+    "- `prepare_data.py`\n",
+    "- `tldr_bank.db` (from bank/)\n",
+    "- `prompts.jsonl` (from benchmark/)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')\n",
+    "\n",
+    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
+    "WORK_DIR = '/content/hunch-training'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Extract toolkit (only on first run)\n",
+    "import os\n",
+    "if not os.path.exists(f'{DRIVE_DIR}/adapter_training_toolkit_v26_0_0'):\n",
+    "    !cd {DRIVE_DIR} && unzip -q adapter_training_toolkit_v26_0_0.zip\n",
+    "    print('Extracted toolkit')\n",
+    "else:\n",
+    "    print('Toolkit already extracted')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copy to local VM (faster I/O during training)\n",
+    "!mkdir -p {WORK_DIR}\n",
+    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
+    "\n",
+    "# Set up bank and benchmark paths for prepare_data.py\n",
+    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
+    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
+    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
+    "!ls {WORK_DIR}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install dependencies\n",
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Verify GPU\n",
+    "import torch\n",
+    "print(f'CUDA available: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}')\n",
+    "    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare Training Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py --stats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Train Adapter\n",
+    "\n",
+    "~30-60 minutes on A100. Reduce batch_size to 2 if OOM."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n",
+    "  --train-data ../train.jsonl \\\n",
+    "  --eval-data ../eval.jsonl \\\n",
+    "  --epochs 5 \\\n",
+    "  --learning-rate 1e-3 \\\n",
+    "  --batch-size 4 \\\n",
+    "  --precision f16 \\\n",
+    "  --activation-checkpointing \\\n",
+    "  --checkpoint-dir ../checkpoints/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Evaluate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_prompts = [\n",
+    "    'find files changed in the last hour',\n",
+    "    'show disk usage',\n",
+    "    'generate a random password',\n",
+    "    'kill a process by name',\n",
+    "    'show http headers of a url',\n",
+    "    'record terminal session',\n",
+    "    'find files larger than 100mb',\n",
+    "    'convert image to different format',\n",
+    "    'show all listening ports',\n",
+    "    'find files modified in the last 7 days',\n",
+    "    'find files owned by root',\n",
+    "    'count lines in all python files',\n",
+    "    'show all environment variables',\n",
+    "    'clear the terminal',\n",
+    "    'compare two files',\n",
+    "]\n",
+    "\n",
+    "import subprocess\n",
+    "for prompt in test_prompts:\n",
+    "    result = subprocess.run(\n",
+    "        ['python3', '-m', 'examples.generate',\n",
+    "         '--prompt', prompt,\n",
+    "         '--checkpoint', '../checkpoints/adapter-final.pt',\n",
+    "         '--max-tokens', '50',\n",
+    "         '--precision', 'f16'],\n",
+    "        capture_output=True, text=True,\n",
+    "        cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
+    "    )\n",
+    "    output = result.stdout.strip().split('\\n')[-1] if result.stdout else result.stderr[:100]\n",
+    "    print(f'Q: {prompt:<45} A: {output}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Compare: Base vs Adapted"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for prompt in test_prompts:\n",
+    "    # Base model (no adapter)\n",
+    "    base = subprocess.run(\n",
+    "        ['python3', '-m', 'examples.generate',\n",
+    "         '--prompt', prompt, '--max-tokens', '50', '--precision', 'f16'],\n",
+    "        capture_output=True, text=True,\n",
+    "        cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
+    "    )\n",
+    "    # Adapted model\n",
+    "    adapted = subprocess.run(\n",
+    "        ['python3', '-m', 'examples.generate',\n",
+    "         '--prompt', prompt,\n",
+    "         '--checkpoint', '../checkpoints/adapter-final.pt',\n",
+    "         '--max-tokens', '50', '--precision', 'f16'],\n",
+    "        capture_output=True, text=True,\n",
+    "        cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
+    "    )\n",
+    "    b = base.stdout.strip().split('\\n')[-1] if base.stdout else '?'\n",
+    "    a = adapted.stdout.strip().split('\\n')[-1] if adapted.stdout else '?'\n",
+    "    print(f'Q: {prompt}')\n",
+    "    print(f'  base:    {b}')\n",
+    "    print(f'  adapted: {a}')\n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Export .fmadapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n",
+    "  --adapter-name hunch \\\n",
+    "  --checkpoint ../checkpoints/adapter-final.pt \\\n",
+    "  --output-dir ../exports/\n",
+    "\n",
+    "!ls -lh {WORK_DIR}/exports/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copy results back to Google Drive\n",
+    "!cp -r {WORK_DIR}/exports {DRIVE_DIR}/\n",
+    "!cp -r {WORK_DIR}/checkpoints {DRIVE_DIR}/\n",
+    "print('Saved to Google Drive. Download hunch.fmadapter and test on macOS 26.')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/training/train_cloud.sh b/training/train_cloud.sh
new file mode 100755
index 0000000..3dd820f
--- /dev/null
+++ b/training/train_cloud.sh
@@ -0,0 +1,96 @@
+#!/bin/bash
+# Train a LoRA adapter on a cloud GPU.
+#
+# Prerequisites:
+#   - GPU machine with CUDA (A10G, A100, etc.)
+#   - Python 3.11+
+#   - Upload this entire training/ directory to the machine
+#
+# Usage:
+#   # On the cloud machine:
+#   cd training
+#   bash train_cloud.sh
+#
+# The script will:
+#   1. Install dependencies
+#   2. Generate training data from the bank
+#   3. Train the adapter (~30-60 min on A10G)
+#   4. Export .fmadapter
+#   5. Evaluate on sample prompts
+
+set -euo pipefail
+
+TOOLKIT_DIR="adapter_training_toolkit_v26_0_0"
+CHECKPOINT_DIR="checkpoints"
+EXPORT_DIR="exports"
+
+echo "=== Setting up environment ==="
+cd "$TOOLKIT_DIR"
+pip install -r requirements.txt
+cd ..
+
+echo "=== Generating training data ==="
+python3 prepare_data.py
+
+echo "=== Training adapter ==="
+echo "This may take 30-60 minutes on an A10G/A100..."
+cd "$TOOLKIT_DIR"
+python3 -m examples.train_adapter \
+  --train-data ../train.jsonl \
+  --eval-data ../eval.jsonl \
+  --epochs 5 \
+  --learning-rate 1e-3 \
+  --batch-size 4 \
+  --precision f16 \
+  --activation-checkpointing \
+  --checkpoint-dir "../$CHECKPOINT_DIR/"
+
+echo "=== Training draft model (for faster inference) ==="
+python3 -m examples.train_draft_model \
+  --checkpoint "../$CHECKPOINT_DIR/adapter-final.pt" \
+  --train-data ../train.jsonl \
+  --eval-data ../eval.jsonl \
+  --epochs 5 \
+  --learning-rate 1e-3 \
+  --batch-size 4 \
+  --precision f16 \
+  --checkpoint-dir "../$CHECKPOINT_DIR/"
+
+echo "=== Evaluating ==="
+PROMPTS=(
+  "find files changed in the last hour"
+  "show disk usage"
+  "generate a random password"
+  "kill a process by name"
+  "show http headers of a url"
+  "record terminal session"
+  "find files larger than 100mb"
+  "convert image to different format"
+  "show all listening ports"
+  "find files modified in the last 7 days"
+)
+
+for prompt in "${PROMPTS[@]}"; do
+  echo -n "Q: $prompt → "
+  python3 -m examples.generate \
+    --prompt "$prompt" \
+    --checkpoint "../$CHECKPOINT_DIR/adapter-final.pt" \
+    --max-tokens 50 \
+    --precision f16 2>/dev/null | tail -1
+done
+
+echo ""
+echo "=== Exporting .fmadapter ==="
+python3 -m export.export_fmadapter \
+  --adapter-name hunch \
+  --checkpoint "../$CHECKPOINT_DIR/adapter-final.pt" \
+  --draft-checkpoint "../$CHECKPOINT_DIR/draft-model-final.pt" \
+  --output-dir "../$EXPORT_DIR/"
+
+cd ..
+echo ""
+echo "=== Done ==="
+echo "Adapter: $EXPORT_DIR/hunch.fmadapter"
+echo "Size: $(du -sh $EXPORT_DIR/hunch.fmadapter 2>/dev/null | cut -f1)"
+echo ""
+echo "Download hunch.fmadapter and test locally on macOS 26."

From aee7d265ce5b3a70c7fc6077f862aecf252b18b3 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Sun, 12 Apr 2026 11:00:23 -0400
Subject: [PATCH 02/11] Remove training README from git, keep locally

---
 .gitignore         |   1 +
 training/README.md | 249 ---------------------------------------------
 2 files changed, 1 insertion(+), 249 deletions(-)
 delete mode 100644 training/README.md

diff --git a/.gitignore b/.gitignore
index b61c4bb..4d326fc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -19,3 +19,4 @@ benchmark/GUIDED_GENERATION.md
 training/train.jsonl
 training/eval.jsonl
 training/adapter_training_toolkit*
+training/README.md
diff --git a/training/README.md b/training/README.md
deleted file mode 100644
index 109d1b5..0000000
--- a/training/README.md
+++ /dev/null
@@ -1,249 +0,0 @@
-# LoRA Adapter Training for Apple's On-Device Model
-
-An experiment in fine-tuning Apple's 3B on-device Foundation Model (AFM) using LoRA adapters. This is an alternative approach to hunch's retrieval-based pipeline — instead of feeding examples at runtime, we bake command knowledge directly into the model weights.
-
-This is primarily an academic exercise and a resource for others exploring Apple's adapter training toolkit. The retrieval approach in hunch already achieves ~83% accuracy; this explores whether fine-tuning can do better, and documents the process for anyone who wants to train their own adapter.
-
-## Background
-
-### Two approaches to the same problem
-
-The base 3B model gets ~40% accuracy on shell command generation. Two ways to improve it:
-
-1. **Retrieval (what hunch ships)**: Search a bank of 21k examples, inject the 8 most relevant into the prompt. The model copies patterns. ~83% accuracy, 0.5s latency, 4MB database, works across OS updates.
-
-2. **Fine-tuning (this experiment)**: Train a LoRA adapter that teaches the model shell commands directly. No retrieval needed at runtime. Potentially higher accuracy, but ~160MB adapter, tied to one OS version, requires retraining on each macOS update.
-
-Neither is strictly better — they have different tradeoffs. A hybrid approach (adapter for base knowledge + retrieval for rare commands) could combine the best of both.
-
-### How LoRA works
-
-The base model has 3.18B parameters across 56 transformer layers. LoRA (Low-Rank Adaptation) freezes all of them and adds small trainable matrices alongside the existing layers:
-
-```
-Original:   output = W(input)                    # W is frozen (huge)
-With LoRA:  output = W(input) + A(B(input))      # A, B are small, trainable
-```
-
-B compresses input to a low-rank space (rank 32), A expands it back. The product A*B is the same shape as W but built from ~66M parameters instead of 3.18B. Only A and B are trained — the base model doesn't change.
-
-The output is a `.fmadapter` file (~160MB) containing only these correction matrices. On-device, it loads on top of the existing model at runtime.
-
-### Caveats
-
-- **Version lock**: Each adapter is tied to one macOS version. New OS release = retrain.
-- **Size**: ~160MB per adapter. Too large to bundle in a CLI tool — would need separate download.
-- **Entitlement**: Not needed for training or local testing. Required for App Store distribution.
-- **Memory**: Training needs 24GB+ VRAM. The 12GB base model weights must fit in memory for the forward pass.
-
-## Prerequisites
-
-1. **Apple Developer Program membership** ($99/year) — needed to download the training toolkit
-2. **Adapter training toolkit** — download from [developer.apple.com/apple-intelligence/foundation-models-adapter/](https://developer.apple.com/apple-intelligence/foundation-models-adapter/)
-3. **GPU with 24GB+ VRAM** — A10G, A100, or similar. Mac with 32GB+ works via MPS. 24GB Mac will OOM.
-4. **Python 3.11+**
-
-## Setup
-
-### 1. Download the toolkit
-
-Sign in at developer.apple.com, download the adapter training toolkit, extract into this directory:
-
-```
-training/
-├── adapter_training_toolkit_v26_0_0/   # Apple's toolkit (~12GB with model weights)
-│   ├── assets/                          # Base model weights (12GB)
-│   ├── examples/                        # Training scripts
-│   ├── export/                          # .fmadapter export (DO NOT MODIFY)
-│   └── requirements.txt
-├── prepare_data.py                      # Converts hunch bank → training JSONL
-├── train_adapter.ipynb                  # Colab notebook
-├── train_cloud.sh                       # CLI training script
-└── README.md
-```
-
-### 2. Prepare training data
-
-```bash
-cd training
-
-# Generate train.jsonl and eval.jsonl from the bank
-python3 prepare_data.py
-
-# Show dataset statistics
-python3 prepare_data.py --stats
-```
-
-This produces ~19k training and ~3k eval examples in Apple's expected format:
-```json
-[
-  {"role": "system", "content": "Output a single shell command for zsh on macOS..."},
-  {"role": "user", "content": "find files changed in the last hour"},
-  {"role": "assistant", "content": "find . -mmin -60"}
-]
-```
-
-Benchmark prompts are excluded from training data to avoid leakage. Override and tldr-osx entries appear in both train and eval sets (they're the most important examples).
-
-## Training
-
-### Option A: Google Colab (recommended)
-
-Easiest path. Requires Colab Pro ($10/month) for A100 GPU access.
-
-1. **Upload to Google Drive:**
-   - Create `My Drive/hunch-training/`
-   - Upload `adapter_training_toolkit_v26_0_0/` (~12GB, do this once)
-   - Upload `prepare_data.py`
-   - Upload `../bank/tldr_bank.db` and `../benchmark/prompts.jsonl`
-
-2. **Install VS Code extension:**
-   - In VS Code, install the "Google Colab" extension
-   - Open `train_adapter.ipynb`
-   - Click "Select Kernel" → Colab → choose A100 runtime
-   - Sign in with Google
-
-3. **Run the notebook cells in order.** Training takes ~30-60 min on A100.
-
-4. **Download the result:** The notebook saves `hunch.fmadapter` back to Google Drive.
-
-### Option B: Cloud GPU via SSH
-
-For Lambda Labs, GCP, AWS, or any machine with a CUDA GPU:
-
-```bash
-# Upload training directory to the machine
-rsync -avz --progress training/ user@gpu-machine:~/training/
-
-# SSH in and run
-ssh user@gpu-machine
-cd training
-bash train_cloud.sh
-```
-
-The script installs dependencies, generates data, trains, evaluates, and exports. Download `exports/hunch.fmadapter` when done.
-
-### Option C: Mac with 32GB+ RAM
-
-Works via MPS (Metal Performance Shaders) but slower than CUDA. 24GB Macs will OOM.
-
-```bash
-cd training/adapter_training_toolkit_v26_0_0
-python3 -m venv venv
-source venv/bin/activate
-pip install -r requirements.txt
-
-# Train with f16 precision and activation checkpointing to save memory
-python3 -m examples.train_adapter \
-  --train-data ../train.jsonl \
-  --eval-data ../eval.jsonl \
-  --epochs 5 \
-  --learning-rate 1e-3 \
-  --batch-size 2 \
-  --precision f16 \
-  --activation-checkpointing \
-  --checkpoint-dir ../checkpoints/
-```
-
-## Evaluation
-
-After training, compare base vs adapted:
-
-```bash
-cd adapter_training_toolkit_v26_0_0
-
-# Base model (no adapter)
-python3 -m examples.generate --prompt "find files changed in the last hour" --precision f16
-
-# With adapter
-python3 -m examples.generate \
-  --prompt "find files changed in the last hour" \
-  --checkpoint ../checkpoints/adapter-final.pt \
-  --precision f16
-```
-
-## Export
-
-```bash
-python3 -m export.export_fmadapter \
-  --adapter-name hunch \
-  --checkpoint ../checkpoints/adapter-final.pt \
-  --output-dir ../exports/
-```
-
-Output: `exports/hunch.fmadapter` (~160MB)
-
-**Do not modify the export code** — the `.fmadapter` format must match exactly for on-device compatibility.
-
-## Loading in Swift
-
-```swift
-// Local testing (no entitlement needed)
-let localURL = URL(filePath: "/path/to/hunch.fmadapter")
-let adapter = try SystemLanguageModel.Adapter(fileURL: localURL)
-let model = SystemLanguageModel(adapter: adapter)
-let session = LanguageModelSession(model: model)
-let response = try await session.respond(to: "find files changed in the last hour")
-print(response.content)
-```
-
-## Dataset
-
-~21k training examples from:
-
-| Source | Entries | Description |
-|--------|---------|-------------|
-| tldr-common | ~18k | Cross-platform shell commands |
-| tldr-osx | ~950 | macOS-specific commands |
-| override | ~130 | Curated corrections and macOS mappings |
-
-After dedup and benchmark exclusion: ~19k train / ~3k eval.
-
-## QLoRA: Training on 24GB Macs
-
-Standard LoRA needs ~24GB+ because the full base model (12GB in fp32) must fit in memory for the forward pass. QLoRA (Quantized LoRA) quantizes the base model to 4-bit before training, reducing memory to ~3GB for weights + overhead for activations and optimizer.
-
-QLoRA doesn't modify Apple's `tamm` library. You add a quantization step **between** loading the model and training — a small wrapper:
-
-```python
-# Conceptual approach (write your own implementation):
-
-# 1. Load base model normally (Apple's code)
-model = load_base_model(...)              # fp32, ~12GB
-
-# 2. Quantize frozen weights to 4-bit (your code)
-model = quantize_to_4bit(model)           # ~3GB — only frozen W matrices
-
-# 3. LoRA matrices stay in fp16 (Apple's code, unchanged)
-# 4. Train as normal — gradients only flow through A, B
-train(model, train_data, ...)
-```
-
-The frozen base weights (W) are quantized and never updated — they just need to run the forward pass. The LoRA matrices (A, B) stay in fp16 for training precision. This is the key insight: you only need full precision for the parameters you're actually training.
-
-Libraries like `bitsandbytes` provide 4-bit quantization for PyTorch. The implementation is ~20-30 lines wrapping the model loading step.
-
-If this works on a 24GB Mac, it's a big deal — anyone with an M-series Mac could train their own adapter without cloud GPUs.
-
-## Sharing and licensing
-
-**What can be shared (MIT / CC-BY):**
-- `prepare_data.py` — training data preparation script
-- `train_adapter.ipynb` — Colab notebook
-- `train_cloud.sh` — CLI training script
-- `train.jsonl` / `eval.jsonl` — derived from tldr (CC-BY 4.0) + overrides (MIT)
-- Benchmark results, analysis, blog posts
-
-**What cannot be shared (Apple proprietary license):**
-- Apple's adapter training toolkit or model weights
-- Modified versions of the `tamm` library
-- The `.fmadapter` export code
-
-Each developer needs their own Apple Developer Program membership ($99/year) to download the toolkit.
-
-## Future work
-
-- **QLoRA implementation**: Get training running on 24GB Macs
-- **Adapter + retrieval hybrid**: Use the adapter for base knowledge, retrieval only for rare/new commands
-- **Benchmark comparison**: Full 100-prompt benchmark — retrieval vs adapter vs hybrid
-- **Blog post**: Document the full process for others exploring Apple's adapter training

From 29d0779135be3ca33024365447b45f6ed14832d4 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Wed, 15 Apr 2026 15:25:11 -0400
Subject: [PATCH 03/11] Add QLoRA training notebooks and adapter benchmark
 approaches

Adds --adapter flag to hunch CLI, QLoRA/LoRA benchmark approaches
in run.py, source filtering in prepare_data.py, and training
notebooks for LoRA, fp16 LoRA, and QLoRA experiments.
---
 .gitignore                     |   2 +
 README.md                      |  10 +
 benchmark/run.py               |  57 +++++
 cli/Sources/Hunch/main.swift   |  17 +-
 training/TRAINING.md           | 159 ++++++++++++
 training/prepare_data.py       |  61 +++--
 training/train_adapter.ipynb   | 282 ---------------------
 training/train_cloud.sh        |  96 -------
 training/train_lora.ipynb      | 408 ++++++++++++++++++++++++++++++
 training/train_lora_fp16.ipynb | 228 +++++++++++++++++
 training/train_qlora.ipynb     | 450 +++++++++++++++++++++++++++++++++
 training/train_qlora_full.py   | 371 +++++++++++++++++++++++++++
 training/train_qlora_test.py   | 216 ++++++++++++++++
 13 files changed, 1953 insertions(+), 404 deletions(-)
 create mode 100644 training/TRAINING.md
 delete mode 100644 training/train_adapter.ipynb
 delete mode 100755 training/train_cloud.sh
 create mode 100644 training/train_lora.ipynb
 create mode 100644 training/train_lora_fp16.ipynb
 create mode 100644 training/train_qlora.ipynb
 create mode 100644 training/train_qlora_full.py
 create mode 100644 training/train_qlora_test.py

diff --git a/.gitignore b/.gitignore
index 4d326fc..3e783e0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -20,3 +20,5 @@ training/train.jsonl
 training/eval.jsonl
 training/adapter_training_toolkit*
 training/README.md
+training/exports/
+training/qlora-checkpoints/
diff --git a/README.md b/README.md
index cbd9bed..ee401a8 100644
--- a/README.md
+++ b/README.md
@@ -171,6 +171,16 @@ make install
 
 This clones [tldr-pages](https://github.com/tldr-pages/tldr), parses all entries into Q/A pairs, adds macOS-specific overrides, and rebuilds the FTS5 index.
 
+## LoRA Adapter Training (experimental)
+
+The `training/` directory contains infrastructure for fine-tuning Apple's on-device 3B model using LoRA adapters. This is an exploration of whether fine-tuning can complement or replace the retrieval approach — see `training/README.md` for full details, results, and notebooks.
+
+```bash
+hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
+```
+
+Current finding: retrieval alone (~83%) still outperforms adapter-based approaches, but the training pipeline and QLoRA patches (enabling training on free Colab T4 GPUs) may be useful to others exploring Apple's adapter toolkit.
+
 ## Known limitations
 
 - **4K token context window** — the system prompt + 8 examples + query + output must fit. Current prompts use ~200-400 tokens, well within budget.
diff --git a/benchmark/run.py b/benchmark/run.py
index b7365ed..b8328c6 100755
--- a/benchmark/run.py
+++ b/benchmark/run.py
@@ -504,6 +504,53 @@ def approach_hunch_multi_warm(prompt):
     return _run_hunch(prompt, ["--guided", "multi", "--temperature", "0.3"])
 
 
+ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "exports" / "hunch.fmadapter")
+QLORA_FP16_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_fp16.fmadapter")
+QLORA_NF4_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora.fmadapter")
+QLORA_OVERRIDE_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_overrides.fmadapter")
+LORA_OVERRIDE_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "exports" / "hunch_overrides.fmadapter")
+
+
+def approach_adapter_only(prompt):
+    """LoRA adapter only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", ADAPTER_PATH, "--limit", "0"])
+
+
+def approach_adapter_retrieval(prompt):
+    """LoRA adapter + retrieval."""
+    return _run_hunch(prompt, ["--adapter", ADAPTER_PATH])
+
+
+def approach_fp16lora_only(prompt):
+    """fp16 LoRA adapter only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH, "--limit", "0"])
+
+
+def approach_fp16lora_retrieval(prompt):
+    """fp16 LoRA adapter + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH])
+
+
+def approach_qlora_only(prompt):
+    """True QLoRA (NF4) adapter only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH, "--limit", "0"])
+
+
+def approach_qlora_retrieval(prompt):
+    """True QLoRA (NF4) adapter + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH])
+
+
+def approach_qlora_override_only(prompt):
+    """QLoRA trained on overrides only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"])
+
+
+def approach_qlora_override_retrieval(prompt):
+    """QLoRA trained on overrides only + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH])
+
+
 def approach_dynshot_tldr(prompt):
     """Dynamic few-shot using tldr+overrides FTS5 index (21k entries)."""
     import sqlite3
@@ -576,6 +623,16 @@ def approach_dynshot_holdout(prompt):
     "hunch-multi": approach_hunch_multi,
     "hunch-cotmulti": approach_hunch_cotmulti,
     "hunch-multi-warm": approach_hunch_multi_warm,
+    "adapter-only": approach_adapter_only,
+    "adapter-retrieval": approach_adapter_retrieval,
+    "fp16lora-only": approach_fp16lora_only,
+    "fp16lora-retrieval": approach_fp16lora_retrieval,
+    "qlora-only": approach_qlora_only,
+    "qlora-retrieval": approach_qlora_retrieval,
+    "qlora-override-only": approach_qlora_override_only,
+    "qlora-override-retrieval": approach_qlora_override_retrieval,
+    "lora-override-only": lambda p: _run_hunch(p, ["--adapter", LORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]),
+    "lora-override-retrieval": lambda p: _run_hunch(p, ["--adapter", LORA_OVERRIDE_ADAPTER_PATH]),
     "hunch-sc": approach_hunch_sc,
     "sc-dynshot": approach_selfconsist_dynshot,
     "sc-warm": approach_selfconsist_warm,
diff --git a/cli/Sources/Hunch/main.swift b/cli/Sources/Hunch/main.swift
index 18c9c65..ba6a296 100644
--- a/cli/Sources/Hunch/main.swift
+++ b/cli/Sources/Hunch/main.swift
@@ -158,6 +158,7 @@ struct Hunch {
         let samples = parseFlag(&args, flag: "--samples").flatMap(Int.init) ?? 1
         let limit = parseFlag(&args, flag: "--limit").flatMap(Int.init) ?? 8
         let guided = parseFlag(&args, flag: "--guided")
+        let adapterPath = parseFlag(&args, flag: "--adapter")
 
         // Parse mode
         var mode: Mode = .suggest
@@ -238,9 +239,19 @@ struct Hunch {
         let systemPrompt = buildSystemPrompt(mode: mode, examples: examples)
 
         do {
-            let model = SystemLanguageModel(
-                guardrails: .permissiveContentTransformations
-            )
+            let model: SystemLanguageModel
+            if let adapterPath {
+                let adapterURL = URL(fileURLWithPath: adapterPath)
+                let adapter = try SystemLanguageModel.Adapter(fileURL: adapterURL)
+                model = SystemLanguageModel(
+                    adapter: adapter,
+                    guardrails: .permissiveContentTransformations
+                )
+            } else {
+                model = SystemLanguageModel(
+                    guardrails: .permissiveContentTransformations
+                )
+            }
 
             // Build generation options only when temperature is set
             let genOptions: GenerationOptions? = temperature.map {
diff --git a/training/TRAINING.md b/training/TRAINING.md
new file mode 100644
index 0000000..5fc782a
--- /dev/null
+++ b/training/TRAINING.md
@@ -0,0 +1,159 @@
+# Training Guide
+
+How to train a LoRA adapter for Apple's on-device 3B Foundation Model using the hunch dataset.
+
+## Prerequisites
+
+1. **Apple Developer Program** ($99/year) — needed to download the training toolkit
+2. **Adapter training toolkit** — download from [developer.apple.com/apple-intelligence/foundation-models-adapter/](https://developer.apple.com/apple-intelligence/foundation-models-adapter/)
+3. **Google account** — for Colab (free tier works for fp16 LoRA, Pro/pay-as-you-go needed for standard LoRA)
+
+## Files
+
+```
+training/
+├── train_lora.ipynb          # LoRA training notebook (needs A100)
+├── train_lora_fp16.ipynb         # fp16 LoRA training notebook (works on free T4)
+├── prepare_data.py           # Converts hunch bank → training JSONL
+└── README.md                 # Full experiment writeup and results
+```
+
+## Quick Start
+
+### 1. Download the toolkit
+
+Download from developer.apple.com, extract into this directory:
+
+```
+training/adapter_training_toolkit_v26_0_0/
+├── assets/          # Base model weights (12GB)
+├── examples/        # Training scripts
+├── export/          # .fmadapter export
+└── requirements.txt
+```
+
+### 2. Upload to Google Drive
+
+Create `My Drive/hunch-training/` and upload:
+
+```
+hunch-training/
+├── adapter_training_toolkit_v26_0_0/   # The extracted toolkit
+├── prepare_data.py                      # From this directory
+├── tldr_bank.db                         # From ../bank/
+└── prompts.jsonl                        # From ../benchmark/
+```
+
+### 3. Choose your notebook
+
+| Notebook | GPU | Cost | Time | Patches needed |
+|----------|-----|------|------|----------------|
+| `train_lora.ipynb` | A100 40GB | Colab Pro ($10/mo) | ~1.5 hours | None |
+| `train_lora_fp16.ipynb` | T4 16GB | Free | ~2 hours | 3 patches (applied automatically) |
+
+### 4. Open in Colab
+
+Install the Google Colab extension in VS Code, open the notebook, select a Colab kernel with the appropriate GPU, and run the cells in order.
+
+Alternatively, upload the notebook to [colab.research.google.com](https://colab.research.google.com) directly.
+
+### 5. Test on-device
+
+Download the exported `.fmadapter` from Google Drive and test with hunch:
+
+```bash
+hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
+```
+
+## Training Data
+
+`prepare_data.py` converts the hunch bank into training JSONL:
+
+```bash
+python3 prepare_data.py         # generates train.jsonl + eval.jsonl
+python3 prepare_data.py --stats # show dataset statistics
+```
+
+Each training example:
+```json
+[
+  {"role": "system", "content": "Output a single shell command for zsh on macOS..."},
+  {"role": "user", "content": "find files changed in the last hour"},
+  {"role": "assistant", "content": "find . -mmin -60"}
+]
+```
+
+- ~19k training / ~3k eval examples
+- Benchmark prompts excluded to avoid data leakage
+- Override and tldr-osx entries appear in both splits
+
+## fp16 LoRA Patches Explained
+
+The fp16 LoRA notebook (`train_lora_fp16.ipynb`) applies three patches to Apple's toolkit to fit training on a T4 (16GB GPU, 12GB system RAM):
+
+**Patch 1 — `utils.py`: mmap + fp16 model + fp32 adapters**
+- `mmap=True` on `torch.load`: reads weights from disk on demand instead of loading 12GB into RAM
+- `model_config.dtype = torch.float16`: creates the model in fp16 (6GB GPU instead of 12GB)
+- Casts adapter weights back to fp32: GradScaler needs fp32 gradients
+
+**Patch 2 — `train_adapter.py`: gradient scaling for f16-mixed**
+- Apple's code only enables GradScaler for a `"f16"` precision mode that isn't exposed as a CLI option
+- When running with `f16-mixed` and an fp16 model, gradients overflow without scaling → loss = NaN
+- Fix: enable GradScaler for `f16-mixed` too
+
+**Patch 3 — `tamm/layers/functional.py`: rms_norm dtype fix**
+- `torch.rms_norm` requires input and weight to have the same dtype
+- fp16 model has fp16 weights, but mixed-precision casts input to fp32
+- Fix: cast weight to match input dtype before calling rms_norm
+
+All patches are applied automatically by the notebook. To restore originals, re-copy from the toolkit on Drive.
+
+## Export
+
+The export step packages the LoRA weights into a `.fmadapter` file that can be loaded on-device:
+
+```bash
+python3 -m export.export_fmadapter \
+  --adapter-name hunch \
+  --checkpoint ../checkpoints/adapter-final.pt \
+  --output-dir ../exports/
+```
+
+Output is ~127MB. The adapter name can only contain letters, numbers, and underscores.
+
+**Do not modify the export code** — the `.fmadapter` format must match exactly for on-device compatibility.
+
+## Loading in Swift
+
+```swift
+let adapter = try SystemLanguageModel.Adapter(fileURL: localURL)
+let model = SystemLanguageModel(adapter: adapter)
+let session = LanguageModelSession(model: model)
+let response = try await session.respond(to: "find files changed in the last hour")
+```
+
+No entitlement needed for local testing. Entitlement required only for App Store distribution.
+
+## Key Training Parameters
+
+| Parameter | LoRA (A100) | fp16 LoRA (T4) |
+|-----------|-------------|------------|
+| `--precision` | bf16-mixed | f16-mixed |
+| `--batch-size` | 8 | 8 |
+| `--learning-rate` | 1e-4 | 1e-4 |
+| `--epochs` | 3 | 3 |
+| `--activation-checkpointing` | yes | yes |
+
+**Note:** lr=1e-3 diverges. Always use 1e-4.
+
+## Troubleshooting
+
+**OOM on T4:** Make sure all three fp16 LoRA patches are applied. Run the patch cell before training.
+
+**loss = NaN:** The rms_norm patch didn't apply, or the pycache is stale. The notebook clears pycache automatically, but if you see NaN, restart the kernel and re-run from the patch cell.
+
+**Return code -9:** The OS killed the process for memory. On T4, this means system RAM (12GB) is full. Make sure mmap is patched (check for `mmap=True` in utils.py).
+
+**Adapter name error:** Use only letters, numbers, and underscores. No hyphens.
+
+**coremltools warnings:** Ignore them. The export works despite the warnings.
diff --git a/training/prepare_data.py b/training/prepare_data.py
index cf98d90..dea923c 100644
--- a/training/prepare_data.py
+++ b/training/prepare_data.py
@@ -55,11 +55,30 @@ def to_training_example(entry):
     ]
 
 
-def prepare_dataset(eval_split=0.1, exclude_benchmark=True, seed=42):
-    """Prepare train/eval splits from the bank."""
+def prepare_dataset(eval_split=0.1, exclude_benchmark=True, seed=42, sources=None):
+    """Prepare train/eval splits from the bank.
+
+    Args:
+        sources: filter by source. Options:
+            None or "all" — everything (default)
+            "override" — overrides only (~130 examples)
+            "macos" — overrides + tldr-osx (~1k examples)
+            "override,tldr-osx" — comma-separated list
+    """
     bank = load_bank()
     print(f"Loaded {len(bank)} entries from bank")
 
+    # Filter by source if specified
+    if sources and sources != "all":
+        allowed = set(s.strip() for s in sources.split(","))
+        # "macos" is a shorthand for override + tldr-osx
+        if "macos" in allowed:
+            allowed.discard("macos")
+            allowed.update(["override", "tldr-osx"])
+        before = len(bank)
+        bank = [e for e in bank if e["source"] in allowed]
+        print(f"Filtered to sources {allowed}: {len(bank)} entries (from {before})")
+
     # Count by source
     by_source = {}
     for entry in bank:
@@ -86,28 +105,22 @@ def prepare_dataset(eval_split=0.1, exclude_benchmark=True, seed=42):
     print(f"After dedup: {len(unique)} unique entries (removed {len(bank) - len(unique)})")
     bank = unique
 
-    # Prioritize overrides and osx entries by including them in both train and eval
-    overrides = [e for e in bank if e["source"] in ("override", "tldr-osx")]
-    common = [e for e in bank if e["source"] not in ("override", "tldr-osx")]
-
-    # Split common entries
+    # Split into train/eval
     random.seed(seed)
-    random.shuffle(common)
-    eval_size = int(len(common) * eval_split)
-    eval_common = common[:eval_size]
-    train_common = common[eval_size:]
-
-    # Overrides go in both train and eval
-    train = overrides + train_common
-    eval_data = overrides + eval_common
-
-    random.shuffle(train)
-    random.shuffle(eval_data)
-
-    print(f"\nDataset split:")
-    print(f"  Train: {len(train)} examples")
-    print(f"  Eval:  {len(eval_data)} examples")
-    print(f"  Override/osx entries in both: {len(overrides)}")
+    random.shuffle(bank)
+    eval_size = max(int(len(bank) * eval_split), 1)
+    eval_data = bank[:eval_size]
+    train = bank[eval_size:]
+
+    # For small datasets, put everything in both
+    if len(bank) < 500:
+        train = bank
+        eval_data = bank
+        print(f"Small dataset — using all {len(bank)} examples for both train and eval")
+    else:
+        print(f"\nDataset split:")
+        print(f"  Train: {len(train)} examples")
+        print(f"  Eval:  {len(eval_data)} examples")
 
     return train, eval_data
 
@@ -151,11 +164,13 @@ def main():
     parser.add_argument("--eval-split", type=float, default=0.1)
     parser.add_argument("--stats", action="store_true")
     parser.add_argument("--no-exclude-benchmark", action="store_true")
+    parser.add_argument("--sources", default=None, help="Filter sources: override, macos, tldr-osx, tldr-common, or all")
     args = parser.parse_args()
 
     train, eval_data = prepare_dataset(
         eval_split=args.eval_split,
         exclude_benchmark=not args.no_exclude_benchmark,
+        sources=args.sources,
     )
 
     if args.stats:
diff --git a/training/train_adapter.ipynb b/training/train_adapter.ipynb
deleted file mode 100644
index 33c20e3..0000000
--- a/training/train_adapter.ipynb
+++ /dev/null
@@ -1,282 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# hunch LoRA Adapter Training\n",
-    "\n",
-    "Train a custom adapter for Apple's 3B on-device Foundation Model.\n",
-    "\n",
-    "**Requirements:**\n",
-    "- Colab Pro (A100 GPU) or any CUDA machine with 24GB+ VRAM\n",
-    "- Apple's adapter training toolkit zip uploaded to Google Drive\n",
-    "- hunch training files (prepare_data.py, bank db, benchmark prompts)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 1. Setup\n",
-    "\n",
-    "Upload to `My Drive/hunch-training/`:\n",
-    "- `adapter_training_toolkit_v26_0_0.zip` (from developer.apple.com)\n",
-    "- `prepare_data.py`\n",
-    "- `tldr_bank.db` (from bank/)\n",
-    "- `prompts.jsonl` (from benchmark/)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from google.colab import drive\n",
-    "drive.mount('/content/drive')\n",
-    "\n",
-    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
-    "WORK_DIR = '/content/hunch-training'"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Extract toolkit (only on first run)\n",
-    "import os\n",
-    "if not os.path.exists(f'{DRIVE_DIR}/adapter_training_toolkit_v26_0_0'):\n",
-    "    !cd {DRIVE_DIR} && unzip -q adapter_training_toolkit_v26_0_0.zip\n",
-    "    print('Extracted toolkit')\n",
-    "else:\n",
-    "    print('Toolkit already extracted')"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Copy to local VM (faster I/O during training)\n",
-    "!mkdir -p {WORK_DIR}\n",
-    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
-    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
-    "\n",
-    "# Set up bank and benchmark paths for prepare_data.py\n",
-    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
-    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
-    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
-    "!ls {WORK_DIR}"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Install dependencies\n",
-    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Verify GPU\n",
-    "import torch\n",
-    "print(f'CUDA available: {torch.cuda.is_available()}')\n",
-    "if torch.cuda.is_available():\n",
-    "    print(f'GPU: {torch.cuda.get_device_name(0)}')\n",
-    "    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 2. Prepare Training Data"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!cd {WORK_DIR} && python3 prepare_data.py --stats"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!cd {WORK_DIR} && python3 prepare_data.py"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 3. Train Adapter\n",
-    "\n",
-    "~30-60 minutes on A100. Reduce batch_size to 2 if OOM."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n",
-    "  --train-data ../train.jsonl \\\n",
-    "  --eval-data ../eval.jsonl \\\n",
-    "  --epochs 5 \\\n",
-    "  --learning-rate 1e-3 \\\n",
-    "  --batch-size 4 \\\n",
-    "  --precision f16 \\\n",
-    "  --activation-checkpointing \\\n",
-    "  --checkpoint-dir ../checkpoints/"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 4. Evaluate"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "test_prompts = [\n",
-    "    'find files changed in the last hour',\n",
-    "    'show disk usage',\n",
-    "    'generate a random password',\n",
-    "    'kill a process by name',\n",
-    "    'show http headers of a url',\n",
-    "    'record terminal session',\n",
-    "    'find files larger than 100mb',\n",
-    "    'convert image to different format',\n",
-    "    'show all listening ports',\n",
-    "    'find files modified in the last 7 days',\n",
-    "    'find files owned by root',\n",
-    "    'count lines in all python files',\n",
-    "    'show all environment variables',\n",
-    "    'clear the terminal',\n",
-    "    'compare two files',\n",
-    "]\n",
-    "\n",
-    "import subprocess\n",
-    "for prompt in test_prompts:\n",
-    "    result = subprocess.run(\n",
-    "        ['python3', '-m', 'examples.generate',\n",
-    "         '--prompt', prompt,\n",
-    "         '--checkpoint', '../checkpoints/adapter-final.pt',\n",
-    "         '--max-tokens', '50',\n",
-    "         '--precision', 'f16'],\n",
-    "        capture_output=True, text=True,\n",
-    "        cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
-    "    )\n",
-    "    output = result.stdout.strip().split('\\n')[-1] if result.stdout else result.stderr[:100]\n",
-    "    print(f'Q: {prompt:<45} A: {output}')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 5. Compare: Base vs Adapted"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "for prompt in test_prompts:\n",
-    "    # Base model (no adapter)\n",
-    "    base = subprocess.run(\n",
-    "        ['python3', '-m', 'examples.generate',\n",
-    "         '--prompt', prompt, '--max-tokens', '50', '--precision', 'f16'],\n",
-    "        capture_output=True, text=True,\n",
-    "        cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
-    "    )\n",
-    "    # Adapted model\n",
-    "    adapted = subprocess.run(\n",
-    "        ['python3', '-m', 'examples.generate',\n",
-    "         '--prompt', prompt,\n",
-    "         '--checkpoint', '../checkpoints/adapter-final.pt',\n",
-    "         '--max-tokens', '50', '--precision', 'f16'],\n",
-    "        capture_output=True, text=True,\n",
-    "        cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
-    "    )\n",
-    "    b = base.stdout.strip().split('\\n')[-1] if base.stdout else '?'\n",
-    "    a = adapted.stdout.strip().split('\\n')[-1] if adapted.stdout else '?'\n",
-    "    print(f'Q: {prompt}')\n",
-    "    print(f'  base:    {b}')\n",
-    "    print(f'  adapted: {a}')\n",
-    "    print()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 6. Export .fmadapter"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n",
-    "  --adapter-name hunch \\\n",
-    "  --checkpoint ../checkpoints/adapter-final.pt \\\n",
-    "  --output-dir ../exports/\n",
-    "\n",
-    "!ls -lh {WORK_DIR}/exports/"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Copy results back to Google Drive\n",
-    "!cp -r {WORK_DIR}/exports {DRIVE_DIR}/\n",
-    "!cp -r {WORK_DIR}/checkpoints {DRIVE_DIR}/\n",
-    "print('Saved to Google Drive. Download hunch.fmadapter and test on macOS 26.')"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python",
-   "version": "3.11.0"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/training/train_cloud.sh b/training/train_cloud.sh
deleted file mode 100755
index 3dd820f..0000000
--- a/training/train_cloud.sh
+++ /dev/null
@@ -1,96 +0,0 @@
-#!/bin/bash
-# Train a LoRA adapter on a cloud GPU.
-#
-# Prerequisites:
-#   - GPU machine with CUDA (A10G, A100, etc.)
-#   - Python 3.11+
-#   - Upload this entire training/ directory to the machine
-#
-# Usage:
-#   # On the cloud machine:
-#   cd training
-#   bash train_cloud.sh
-#
-# The script will:
-#   1. Install dependencies
-#   2. Generate training data from the bank
-#   3. Train the adapter (~30-60 min on A10G)
-#   4. Export .fmadapter
-#   5. Evaluate on sample prompts
-
-set -euo pipefail
-
-TOOLKIT_DIR="adapter_training_toolkit_v26_0_0"
-CHECKPOINT_DIR="checkpoints"
-EXPORT_DIR="exports"
-
-echo "=== Setting up environment ==="
-cd "$TOOLKIT_DIR"
-pip install -r requirements.txt
-cd ..
-
-echo "=== Generating training data ==="
-python3 prepare_data.py
-
-echo "=== Training adapter ==="
-echo "This may take 30-60 minutes on an A10G/A100..."
-cd "$TOOLKIT_DIR"
-python3 -m examples.train_adapter \
-  --train-data ../train.jsonl \
-  --eval-data ../eval.jsonl \
-  --epochs 5 \
-  --learning-rate 1e-3 \
-  --batch-size 4 \
-  --precision f16 \
-  --activation-checkpointing \
-  --checkpoint-dir "../$CHECKPOINT_DIR/"
-
-echo "=== Training draft model (for faster inference) ==="
-python3 -m examples.train_draft_model \
-  --checkpoint "../$CHECKPOINT_DIR/adapter-final.pt" \
-  --train-data ../train.jsonl \
-  --eval-data ../eval.jsonl \
-  --epochs 5 \
-  --learning-rate 1e-3 \
-  --batch-size 4 \
-  --precision f16 \
-  --checkpoint-dir "../$CHECKPOINT_DIR/"
-
-echo "=== Evaluating ==="
-PROMPTS=(
-  "find files changed in the last hour"
-  "show disk usage"
-  "generate a random password"
-  "kill a process by name"
-  "show http headers of a url"
-  "record terminal session"
-  "find files larger than 100mb"
-  "convert image to different format"
-  "show all listening ports"
-  "find files modified in the last 7 days"
-)
-
-for prompt in "${PROMPTS[@]}"; do
-  echo -n "Q: $prompt → "
-  python3 -m examples.generate \
-    --prompt "$prompt" \
-    --checkpoint "../$CHECKPOINT_DIR/adapter-final.pt" \
-    --max-tokens 50 \
-    --precision f16 2>/dev/null | tail -1
-done
-
-echo ""
-echo "=== Exporting .fmadapter ==="
-python3 -m export.export_fmadapter \
-  --adapter-name hunch \
-  --checkpoint "../$CHECKPOINT_DIR/adapter-final.pt" \
-  --draft-checkpoint "../$CHECKPOINT_DIR/draft-model-final.pt" \
-  --output-dir "../$EXPORT_DIR/"
-
-cd ..
-echo ""
-echo "=== Done ==="
-echo "Adapter: $EXPORT_DIR/hunch.fmadapter"
-echo "Size: $(du -sh $EXPORT_DIR/hunch.fmadapter 2>/dev/null | cut -f1)"
-echo ""
-echo "Download hunch.fmadapter and test locally on macOS 26."
diff --git a/training/train_lora.ipynb b/training/train_lora.ipynb
new file mode 100644
index 0000000..94371ad
--- /dev/null
+++ b/training/train_lora.ipynb
@@ -0,0 +1,408 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LoRA: Training Apple's 3B Model on A100\n",
+    "\n",
+    "Standard LoRA training using Apple's adapter toolkit. Requires A100 (40GB GPU).\n",
+    "For free T4 training, see `train_qlora.ipynb`.\n",
+    "\n",
+    "**Results:** 3 epochs, ~2.5 hours. Loss: 1.94 → 1.17 → 0.61 (train), eval: 1.50 → 1.12 → 1.10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "Upload to `My Drive/hunch-training/`:\n",
+    "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n",
+    "- `prepare_data.py`, `tldr_bank.db`, `prompts.jsonl`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mounted at /content/drive\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m117.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m40.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m10.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m93.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hCUDA: True\n",
+      "GPU: NVIDIA A100-SXM4-40GB\n"
+     ]
+    }
+   ],
+   "source": [
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')\n",
+    "\n",
+    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
+    "WORK_DIR = '/content/hunch-training'\n",
+    "\n",
+    "!mkdir -p {WORK_DIR}\n",
+    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
+    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
+    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
+    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
+    "\n",
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n",
+    "\n",
+    "import torch\n",
+    "print(f'CUDA: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded 21478 entries from bank\n",
+      "Filtered to sources {'override'}: 134 entries (from 21478)\n",
+      "  override: 134\n",
+      "Excluded 38 entries matching benchmark prompts\n",
+      "After dedup: 96 unique entries (removed 0)\n",
+      "Small dataset — using all 96 examples for both train and eval\n",
+      "Wrote 96 examples to /content/hunch-training/train.jsonl\n",
+      "Wrote 96 examples to /content/hunch-training/eval.jsonl\n",
+      "\n",
+      "Sample training examples:\n",
+      "  user: show response headers\n",
+      "  asst: curl -I https://example.com\n",
+      "\n",
+      "  user: dns lookup for a domain\n",
+      "  asst: dig example.com\n",
+      "\n",
+      "  user: record shell session to file\n",
+      "  asst: script session.log\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py --sources override"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Train\n",
+    "\n",
+    "No patches needed for A100. ~25 min/epoch, ~1.5 hours total.\n",
+    "\n",
+    "**Note:** lr=1e-3 diverged in testing. Use 1e-4."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Fine-tuning adapters with configuration: \n",
+      "AdapterTrainingConfiguration(epochs=20, learning_rate=0.0001, batch_size=8, linear_warmup_epochs=1, gradient_accumulation_steps=1, enable_activation_checkpointing=True, precision='bf16-mixed', compile_model=False, weight_decay=0.01, clip_grad_norm=1.0, max_sequence_length=None, fixed_sized_sequences=False, pack_sequences=False, loss_update_frequency=3)\n",
+      "Loading base model on cuda with precision torch.float32\n",
+      "/usr/local/lib/python3.12/dist-packages/tamm/layers/flash_attention.py:78: UserWarning: Failed to import flash-attn for Flash attention. Using flash attention may lead to significantly faster training. Please refer to tamm-scripts/install_flash_attn.sh for instructions.\n",
+      "  _warnings.warn(\n",
+      "Total parameters 3178001792\n",
+      "Total trainable parameters 66633728\n",
+      "Gradient scaling is enabled: False\n",
+      "Epoch 1/20\n",
+      "Training: 100% 12/12 [00:08<00:00,  1.42it/s, loss=1.64]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.12it/s, loss=1.05]\n",
+      "Epoch 2/20\n",
+      "INFO:examples.utils:Epoch 2/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.795]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.26it/s, loss=0.32] \n",
+      "Epoch 3/20\n",
+      "INFO:examples.utils:Epoch 3/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.283]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.27it/s, loss=0.116]\n",
+      "Epoch 4/20\n",
+      "INFO:examples.utils:Epoch 4/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.0817]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.25it/s, loss=0.0388] \n",
+      "Epoch 5/20\n",
+      "INFO:examples.utils:Epoch 5/20\n",
+      "Training: 100% 12/12 [00:06<00:00,  1.72it/s, loss=0.0895]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.27it/s, loss=0.0261] \n",
+      "Epoch 6/20\n",
+      "INFO:examples.utils:Epoch 6/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.0223]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.28it/s, loss=0.0127]\n",
+      "Epoch 7/20\n",
+      "INFO:examples.utils:Epoch 7/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.0104]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.23it/s, loss=0.0147]\n",
+      "Epoch 8/20\n",
+      "INFO:examples.utils:Epoch 8/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.0163] \n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.37it/s, loss=0.00656]\n",
+      "Epoch 9/20\n",
+      "INFO:examples.utils:Epoch 9/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.0194] \n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.26it/s, loss=0.000864]\n",
+      "Epoch 10/20\n",
+      "INFO:examples.utils:Epoch 10/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.000877]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.29it/s, loss=0.000607]\n",
+      "Epoch 11/20\n",
+      "INFO:examples.utils:Epoch 11/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.000526]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.22it/s, loss=0.000396]\n",
+      "Epoch 12/20\n",
+      "INFO:examples.utils:Epoch 12/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.000395]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.32it/s, loss=0.000287]\n",
+      "Epoch 13/20\n",
+      "INFO:examples.utils:Epoch 13/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.71it/s, loss=0.00031]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.26it/s, loss=0.000221]\n",
+      "Epoch 14/20\n",
+      "INFO:examples.utils:Epoch 14/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.000229]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.34it/s, loss=0.000198]\n",
+      "Epoch 15/20\n",
+      "INFO:examples.utils:Epoch 15/20\n",
+      "Training: 100% 12/12 [00:06<00:00,  1.72it/s, loss=0.000201]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.30it/s, loss=0.000169]\n",
+      "Epoch 16/20\n",
+      "INFO:examples.utils:Epoch 16/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.000196]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.19it/s, loss=0.000161]\n",
+      "Epoch 17/20\n",
+      "INFO:examples.utils:Epoch 17/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.000155]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.25it/s, loss=0.000165]\n",
+      "Epoch 18/20\n",
+      "INFO:examples.utils:Epoch 18/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.000159]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.22it/s, loss=0.000159]\n",
+      "Epoch 19/20\n",
+      "INFO:examples.utils:Epoch 19/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.00016] \n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.30it/s, loss=0.000156]\n",
+      "Epoch 20/20\n",
+      "INFO:examples.utils:Epoch 20/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.000163]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.28it/s, loss=0.000156]\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n",
+    "  --train-data ../train.jsonl \\\n",
+    "  --eval-data ../eval.jsonl \\\n",
+    "  --epochs 20 \\\n",
+    "  --learning-rate 1e-4 \\\n",
+    "  --batch-size 8 \\\n",
+    "  --precision bf16-mixed \\\n",
+    "  --activation-checkpointing \\\n",
+    "  --checkpoint-dir ../lora-override-checkpoints/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Save checkpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Checkpoints saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cp -r {WORK_DIR}/lora-override-checkpoints {DRIVE_DIR}/lora-override-checkpoints\n",
+    "!echo 'Checkpoints saved to Drive'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Export .fmadapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "scikit-learn version 1.6.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.\n",
+      "XGBoost version 3.2.0 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.\n",
+      "2026-04-15 17:21:10.930166: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
+      "2026-04-15 17:21:10.949095: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
+      "E0000 00:00:1776273670.972532    4269 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "E0000 00:00:1776273670.980305    4269 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "W0000 00:00:1776273671.000652    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776273671.000678    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776273671.000681    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776273671.000684    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "2026-04-15 17:21:11.005962: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.\n",
+      "Torch version 2.10.0+cu128 has not been tested with coremltools. You may run into unexpected errors. Torch 2.5.0 is the most recent version that has been tested.\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'\n",
+      "total 4.0K\n",
+      "drwxr-xr-x 2 root root 4.0K Apr 15 17:21 hunch.fmadapter\n",
+      "Adapter exported and saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n",
+    "  --adapter-name hunch \\\n",
+    "  --checkpoint ../lora-override-checkpoints/adapter-final.pt \\\n",
+    "  --output-dir ../lora-override-exports/\n",
+    "\n",
+    "!ls -lh {WORK_DIR}/lora-override-exports/\n",
+    "!cp -r {WORK_DIR}/lora-override-exports {DRIVE_DIR}/lora-override-exports\n",
+    "!echo 'Adapter exported and saved to Drive'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Evaluate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "No output. Check error:\n",
+      "STDERR: ^^^^^^^\n",
+      "  File \"/content/hunch-training/adapter_training_toolkit_v26_0_0/examples/utils.py\", line 167, in load_base_model\n",
+      "    with Path(checkpoint_path).open(\"rb\") as f:\n",
+      "         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "  File \"/usr/lib/python3.12/pathlib.py\", line 1013, in open\n",
+      "    return io.open(self, mode, buffering, encoding, errors, newline)\n",
+      "           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
+      "FileNotFoundError: [Errno 2] No such file or directory: '../checkpoints/adapter-final.pt'\n",
+      "\n",
+      "Return code: 1\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json, subprocess\n",
+    "\n",
+    "test_prompts = [\n",
+    "    'find files changed in the last hour',\n",
+    "    'show disk usage',\n",
+    "    'generate a random password',\n",
+    "    'kill a process by name',\n",
+    "    'show http headers of a url',\n",
+    "    'record terminal session',\n",
+    "    'find files larger than 100mb',\n",
+    "    'convert image to different format',\n",
+    "    'show all listening ports',\n",
+    "    'find files modified in the last 7 days',\n",
+    "    'find files owned by root',\n",
+    "    'count lines in all python files',\n",
+    "    'show all environment variables',\n",
+    "    'clear the terminal',\n",
+    "    'compare two files',\n",
+    "]\n",
+    "\n",
+    "system = 'Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command.'\n",
+    "\n",
+    "with open(f'{WORK_DIR}/test_prompts.jsonl', 'w') as f:\n",
+    "    for p in test_prompts:\n",
+    "        f.write(json.dumps([\n",
+    "            {'role': 'system', 'content': system},\n",
+    "            {'role': 'user', 'content': p}\n",
+    "        ]) + '\\n')\n",
+    "\n",
+    "result = subprocess.run(\n",
+    "    ['python3', '-m', 'examples.generate',\n",
+    "     '--prompt', '../test_prompts.jsonl',\n",
+    "     '--checkpoint', '../checkpoints/adapter-final.pt',\n",
+    "     '--precision', 'bf16-mixed'],\n",
+    "    capture_output=True, text=True,\n",
+    "    cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
+    ")\n",
+    "\n",
+    "lines = (result.stdout + result.stderr).strip().split('\\n')\n",
+    "idx = 0\n",
+    "for line in lines:\n",
+    "    if 'Response for prompt' in line:\n",
+    "        answer = line.split(': ', 2)[-1].replace('<turn_end>', '').strip()\n",
+    "        prompt = test_prompts[idx] if idx < len(test_prompts) else '?'\n",
+    "        print(f'Q: {prompt:<45} A: {answer}')\n",
+    "        idx += 1\n",
+    "\n",
+    "if idx == 0:\n",
+    "    print('No output. Check error:')\n",
+    "    print('STDERR:', result.stderr[-500:])\n",
+    "    print('Return code:', result.returncode)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/training/train_lora_fp16.ipynb b/training/train_lora_fp16.ipynb
new file mode 100644
index 0000000..f8e498f
--- /dev/null
+++ b/training/train_lora_fp16.ipynb
@@ -0,0 +1,228 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# fp16 LoRA: Training Apple's 3B Model on a Free T4\n\nThree patches to Apple's adapter training toolkit enable training on Colab's free T4 GPU (16GB):\n\n1. **mmap loading** — reads weights from disk on demand, avoids 12GB system RAM spike\n2. **fp16 model + fp32 adapters** — halves GPU memory from 12GB to 6GB\n3. **rms_norm fix + gradient scaling** — fixes dtype mismatches that cause NaN\n\nResult: ~2 hours training on free T4. This is half-precision LoRA (fp16 base), not true QLoRA (4-bit NF4)."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "Upload to `My Drive/hunch-training/`:\n",
+    "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n",
+    "- `prepare_data.py`, `tldr_bank.db`, `prompts.jsonl`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')\n",
+    "\n",
+    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
+    "WORK_DIR = '/content/hunch-training'\n",
+    "\n",
+    "!mkdir -p {WORK_DIR}\n",
+    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
+    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
+    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
+    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
+    "\n",
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n",
+    "\n",
+    "import torch\n",
+    "print(f'CUDA: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Apply patches\n",
+    "\n",
+    "Three patches make training fit on T4 (16GB GPU, 12GB RAM):\n",
+    "\n",
+    "**Patch 1 — `utils.py`:** mmap loading (0 RAM), fp16 model (6GB GPU), fp32 adapters (stable gradients)\n",
+    "\n",
+    "**Patch 2 — `train_adapter.py`:** enable gradient scaling for f16-mixed (prevents NaN overflow)\n",
+    "\n",
+    "**Patch 3 — `tamm/layers/functional.py`:** cast rms_norm weight to match input dtype (prevents NaN from dtype mismatch)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob, shutil\n",
+    "\n",
+    "# --- Patch 1: utils.py ---\n",
+    "# Restore clean copy first\n",
+    "!cp {DRIVE_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py \\\n",
+    "    {WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py\n",
+    "\n",
+    "utils_path = f'{WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py'\n",
+    "code = open(utils_path).read()\n",
+    "\n",
+    "# 1a: Force fp16 model creation (6GB instead of 12GB on GPU)\n",
+    "code = code.replace(\n",
+    "    'model_config.dtype = dtype or model_config.dtype',\n",
+    "    'model_config.dtype = torch.float16'\n",
+    ")\n",
+    "\n",
+    "# 1b: mmap loading (weights stay on disk, ~0 system RAM)\n",
+    "code = code.replace(\n",
+    "    '''    with Path(base_model_checkpoint_path).open(\"rb\") as f:\\n        sd = torch.load(f, map_location=device, weights_only=False)\\n        _ = model.load_state_dict(sd, strict=True)''',\n",
+    "    '''    sd = torch.load(str(base_model_checkpoint_path), map_location=device, mmap=True, weights_only=False)\\n    _ = model.load_state_dict(sd, strict=True)\\n    del sd; import gc; gc.collect()'''\n",
+    ")\n",
+    "\n",
+    "# 1c: Keep adapter weights in fp32 (GradScaler needs fp32 gradients)\n",
+    "code = code.replace(\n",
+    "    '    return model.to(device=device, dtype=model_config.dtype)',\n",
+    "    '''    model = model.to(device=device, dtype=model_config.dtype)\n",
+    "\n",
+    "    # Keep adapter weights in fp32 for stable training\n",
+    "    for name, parameter in model.named_parameters():\n",
+    "        if \"adapter\" in name:\n",
+    "            parameter.data = parameter.data.float()\n",
+    "\n",
+    "    return model'''\n",
+    ")\n",
+    "\n",
+    "open(utils_path, 'w').write(code)\n",
+    "print('Patch 1 applied: utils.py (mmap + fp16 + fp32 adapters)')\n",
+    "\n",
+    "# --- Patch 2: train_adapter.py ---\n",
+    "!cp {DRIVE_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py \\\n",
+    "    {WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py\n",
+    "\n",
+    "ta_path = f'{WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py'\n",
+    "code = open(ta_path).read()\n",
+    "code = code.replace(\n",
+    "    'return self.precision == \"f16\"',\n",
+    "    'return self.precision in (\"f16\", \"f16-mixed\")'\n",
+    ")\n",
+    "open(ta_path, 'w').write(code)\n",
+    "print('Patch 2 applied: train_adapter.py (gradient scaling for f16-mixed)')\n",
+    "\n",
+    "# --- Patch 3: tamm rms_norm ---\n",
+    "norm_files = glob.glob(f'{WORK_DIR}/**/tamm/layers/functional.py', recursive=True)\n",
+    "norm_files += glob.glob('/usr/local/lib/**/tamm/layers/functional.py', recursive=True)\n",
+    "for nf in norm_files:\n",
+    "    code = open(nf).read()\n",
+    "    if 'weight.to(tensor.dtype)' not in code:\n",
+    "        old = '        tensor = _torch_compatibility.rms_norm(\\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\\n        )'\n",
+    "        new = '        if weight is not None and weight.dtype != tensor.dtype:\\n            weight = weight.to(tensor.dtype)\\n        tensor = _torch_compatibility.rms_norm(\\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\\n        )'\n",
+    "        code = code.replace(old, new)\n",
+    "        open(nf, 'w').write(code)\n",
+    "        print(f'Patch 3 applied: {nf} (rms_norm dtype fix)')\n",
+    "    else:\n",
+    "        print(f'Patch 3 already applied: {nf}')\n",
+    "\n",
+    "# Clear pycache\n",
+    "for d in glob.glob(f'{WORK_DIR}/**/tamm/**/__pycache__', recursive=True):\n",
+    "    shutil.rmtree(d, ignore_errors=True)\n",
+    "for d in glob.glob('/usr/local/lib/**/tamm/**/__pycache__', recursive=True):\n",
+    "    shutil.rmtree(d, ignore_errors=True)\n",
+    "print('\\nAll patches applied. Ready to train.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Train\n",
+    "\n",
+    "~40 min/epoch on T4, ~2 hours total for 3 epochs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n  --train-data ../train.jsonl \\\n  --eval-data ../eval.jsonl \\\n  --epochs 3 \\\n  --learning-rate 1e-4 \\\n  --batch-size 8 \\\n  --precision f16-mixed \\\n  --activation-checkpointing \\\n  --checkpoint-dir ../fp16-lora-checkpoints/"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Save checkpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cp -r {WORK_DIR}/fp16-lora-checkpoints {DRIVE_DIR}/\n!echo 'Checkpoints saved to Drive'"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Evaluate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import json, subprocess\n\ntest_prompts = [\n    'find files changed in the last hour',\n    'show disk usage',\n    'generate a random password',\n    'kill a process by name',\n    'show http headers of a url',\n    'record terminal session',\n    'find files larger than 100mb',\n    'convert image to different format',\n    'show all listening ports',\n    'find files modified in the last 7 days',\n    'find files owned by root',\n    'count lines in all python files',\n    'show all environment variables',\n    'clear the terminal',\n    'compare two files',\n]\n\nsystem = 'Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command.'\n\nwith open(f'{WORK_DIR}/test_prompts.jsonl', 'w') as f:\n    for p in test_prompts:\n        f.write(json.dumps([\n            {'role': 'system', 'content': system},\n            {'role': 'user', 'content': p}\n        ]) + '\\n')\n\nresult = subprocess.run(\n    ['python3', '-m', 'examples.generate',\n     '--prompt', '../test_prompts.jsonl',\n     '--checkpoint', '../fp16-lora-checkpoints/adapter-final.pt',\n     '--precision', 'f16-mixed'],\n    capture_output=True, text=True,\n    cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n)\n\nlines = (result.stdout + result.stderr).strip().split('\\n')\nidx = 0\nfor line in lines:\n    if 'Response for prompt' in line:\n        answer = line.split(': ', 2)[-1].replace('<turn_end>', '').strip()\n        prompt = test_prompts[idx] if idx < len(test_prompts) else '?'\n        print(f'Q: {prompt:<45} A: {answer}')\n        idx += 1\n\nif idx == 0:\n    print('No output. Check error:')\n    print('STDERR:', result.stderr[-500:])\n    print('Return code:', result.returncode)"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Export .fmadapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n  --adapter-name hunch_fp16 \\\n  --checkpoint ../fp16-lora-checkpoints/adapter-final.pt \\\n  --output-dir ../fp16-lora-exports/\n\n!ls -lh {WORK_DIR}/fp16-lora-exports/\n!cp -r {WORK_DIR}/fp16-lora-exports {DRIVE_DIR}/\n!echo 'Adapter exported and saved to Drive'"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/training/train_qlora.ipynb b/training/train_qlora.ipynb
new file mode 100644
index 0000000..ee9d0e5
--- /dev/null
+++ b/training/train_qlora.ipynb
@@ -0,0 +1,450 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# True QLoRA: Training Apple's 3B Model with 4-bit NF4\n",
+    "\n",
+    "Uses bitsandbytes NF4 quantization on the frozen base model.\n",
+    "Only ~5GB GPU memory — fits on free T4 with massive headroom.\n",
+    "\n",
+    "This is proper QLoRA as defined by [Dettmers et al. 2023](https://arxiv.org/abs/2305.14314):\n",
+    "4-bit quantized base + fp32 LoRA adapters."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "Upload to `My Drive/hunch-training/`:\n",
+    "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n",
+    "- `prepare_data.py`, `train_qlora_full.py`, `tldr_bank.db`, `prompts.jsonl`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mounted at /content/drive\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m62.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m33.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m9.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m84.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hCUDA: True\n",
+      "GPU: Tesla T4\n"
+     ]
+    }
+   ],
+   "source": [
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')\n",
+    "\n",
+    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
+    "WORK_DIR = '/content/hunch-training'\n",
+    "\n",
+    "!mkdir -p {WORK_DIR}\n",
+    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/train_qlora_full.py {WORK_DIR}/\n",
+    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
+    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
+    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
+    "\n",
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n",
+    "!pip install bitsandbytes -q\n",
+    "\n",
+    "import torch\n",
+    "print(f'CUDA: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded 21478 entries from bank\n",
+      "Filtered to sources {'override'}: 134 entries (from 21478)\n",
+      "  override: 134\n",
+      "Excluded 38 entries matching benchmark prompts\n",
+      "After dedup: 96 unique entries (removed 0)\n",
+      "Small dataset — using all 96 examples for both train and eval\n",
+      "Wrote 96 examples to /content/hunch-training/train.jsonl\n",
+      "Wrote 96 examples to /content/hunch-training/eval.jsonl\n",
+      "\n",
+      "Sample training examples:\n",
+      "  user: show response headers\n",
+      "  asst: curl -I https://example.com\n",
+      "\n",
+      "  user: dns lookup for a domain\n",
+      "  asst: dig example.com\n",
+      "\n",
+      "  user: record shell session to file\n",
+      "  asst: script session.log\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py --sources override"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Train\n",
+    "\n",
+    "No patches needed — `train_qlora_full.py` handles everything:\n",
+    "mmap loading, NF4 quantization, training loop.\n",
+    "\n",
+    "~5GB GPU memory. Can use large batch sizes on T4."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Device: cuda | RAM=0.6GB GPU=0.0GB\n",
+      "Quantized 280 layers to NF4\n",
+      "Trainable: 67M params | RAM=6.3GB GPU=2.2GB\n",
+      "Train: 96 examples, 12 batches\n",
+      "Eval: 96 examples\n",
+      "\n",
+      "============================================================\n",
+      "Training: 20 epochs, batch 8, lr 0.0001\n",
+      "============================================================\n",
+      "\n",
+      "Epoch 1/20\n",
+      "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = c10::Half, Cannot dispatch to fused implementation. (Triggered internally at /pytorch/aten/src/ATen/native/layer_norm.cpp:344.)\n",
+      "  return torch.rms_norm(input, normalized_shape, weight, eps)\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch1.pt\n",
+      "  Train loss: 3.4408 | Eval loss: 1.6959 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 2/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch2.pt\n",
+      "  Train loss: 0.8634 | Eval loss: 0.4109 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 3/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch3.pt\n",
+      "  Train loss: 0.3740 | Eval loss: 0.2714 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 4/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch4.pt\n",
+      "  Train loss: 0.2466 | Eval loss: 0.1851 | RAM=1.7GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 5/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch5.pt\n",
+      "  Train loss: 0.1741 | Eval loss: 0.1375 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 6/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch6.pt\n",
+      "  Train loss: 0.1382 | Eval loss: 0.1082 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 7/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch7.pt\n",
+      "  Train loss: 0.1167 | Eval loss: 0.1073 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 8/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch8.pt\n",
+      "  Train loss: 0.1093 | Eval loss: 0.0957 | RAM=1.7GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 9/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch9.pt\n",
+      "  Train loss: 0.1067 | Eval loss: 0.0935 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 10/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch10.pt\n",
+      "  Train loss: 0.1007 | Eval loss: 0.0905 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 11/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch11.pt\n",
+      "  Train loss: 0.0995 | Eval loss: 0.0953 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 12/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch12.pt\n",
+      "  Train loss: 0.1022 | Eval loss: 0.0891 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 13/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch13.pt\n",
+      "  Train loss: 0.0969 | Eval loss: 0.0886 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 14/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch14.pt\n",
+      "  Train loss: 0.0951 | Eval loss: 0.0883 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 15/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch15.pt\n",
+      "  Train loss: 0.0939 | Eval loss: 0.0867 | RAM=1.7GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 16/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch16.pt\n",
+      "  Train loss: 0.0936 | Eval loss: 0.0867 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 17/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch17.pt\n",
+      "  Train loss: 0.0933 | Eval loss: 0.0861 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 18/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch18.pt\n",
+      "  Train loss: 0.0946 | Eval loss: 0.0865 | RAM=1.5GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 19/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch19.pt\n",
+      "  Train loss: 0.0925 | Eval loss: 0.0866 | RAM=1.7GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 20/20\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch20.pt\n",
+      "  Train loss: 0.0928 | Eval loss: 0.0857 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-final.pt\n",
+      "\n",
+      "Done! Export with:\n",
+      "  python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint qlora-override-checkpoints//adapter-final.pt --output-dir qlora-override-checkpoints///\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR} && python3 train_qlora_full.py \\\n",
+    "  --epochs 20 \\\n",
+    "  --batch-size 8 \\\n",
+    "  --learning-rate 1e-4 \\\n",
+    "  --checkpoint-dir qlora-override-checkpoints/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Save checkpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Checkpoints saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cp -r {WORK_DIR}/qlora-override-checkpoints/ {DRIVE_DIR}/qlora-override-checkpoints\n",
+    "!echo 'Checkpoints saved to Drive'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Evaluate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "No output. Check error:\n",
+      "STDERR: \n",
+      "Return code: -9\n"
+     ]
+    }
+   ],
+   "source": [
+    "import json, subprocess\n",
+    "\n",
+    "test_prompts = [\n",
+    "    'find files changed in the last hour',\n",
+    "    'show disk usage',\n",
+    "    'generate a random password',\n",
+    "    'kill a process by name',\n",
+    "    'show http headers of a url',\n",
+    "    'record terminal session',\n",
+    "    'find files larger than 100mb',\n",
+    "    'convert image to different format',\n",
+    "    'show all listening ports',\n",
+    "    'find files modified in the last 7 days',\n",
+    "    'find files owned by root',\n",
+    "    'count lines in all python files',\n",
+    "    'show all environment variables',\n",
+    "    'clear the terminal',\n",
+    "    'compare two files',\n",
+    "]\n",
+    "\n",
+    "system = 'Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command.'\n",
+    "\n",
+    "with open(f'{WORK_DIR}/test_prompts.jsonl', 'w') as f:\n",
+    "    for p in test_prompts:\n",
+    "        f.write(json.dumps([\n",
+    "            {'role': 'system', 'content': system},\n",
+    "            {'role': 'user', 'content': p}\n",
+    "        ]) + '\\n')\n",
+    "\n",
+    "result = subprocess.run(\n",
+    "    ['python3', '-m', 'examples.generate',\n",
+    "     '--prompt', '../test_prompts.jsonl',\n",
+    "     '--checkpoint', '../qlora-checkpoints/adapter-final.pt',\n",
+    "     '--precision', 'f16-mixed'],\n",
+    "    capture_output=True, text=True,\n",
+    "    cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
+    ")\n",
+    "\n",
+    "lines = (result.stdout + result.stderr).strip().split('\\n')\n",
+    "idx = 0\n",
+    "for line in lines:\n",
+    "    if 'Response for prompt' in line:\n",
+    "        answer = line.split(': ', 2)[-1].replace('<turn_end>', '').strip()\n",
+    "        prompt = test_prompts[idx] if idx < len(test_prompts) else '?'\n",
+    "        print(f'Q: {prompt:<45} A: {answer}')\n",
+    "        idx += 1\n",
+    "\n",
+    "if idx == 0:\n",
+    "    print('No output. Check error:')\n",
+    "    print('STDERR:', result.stderr[-500:])\n",
+    "    print('Return code:', result.returncode)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Export .fmadapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "scikit-learn version 1.6.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.\n",
+      "XGBoost version 3.2.0 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.\n",
+      "2026-04-15 16:39:35.262475: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
+      "E0000 00:00:1776271175.518226   51164 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "E0000 00:00:1776271175.586162   51164 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "W0000 00:00:1776271176.102805   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776271176.102861   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776271176.102866   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776271176.102871   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "2026-04-15 16:39:36.148423: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.\n",
+      "Torch version 2.10.0+cu128 has not been tested with coremltools. You may run into unexpected errors. Torch 2.5.0 is the most recent version that has been tested.\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'\n",
+      "total 4.0K\n",
+      "drwxr-xr-x 2 root root 4.0K Apr 15 16:39 hunch_qlora.fmadapter\n",
+      "Adapter exported and saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n",
+    "  --adapter-name hunch_qlora \\\n",
+    "  --checkpoint ../qlora-override-checkpoints/adapter-final.pt \\\n",
+    "  --output-dir ../qlora-override-exports/\n",
+    "\n",
+    "!ls -lh {WORK_DIR}/qlora-override-exports/\n",
+    "!cp -r {WORK_DIR}/qlora-override-exports {DRIVE_DIR}/qlora-override-exports\n",
+    "!echo 'Adapter exported and saved to Drive'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "total 4.0K\n",
+      "drwxr-xr-x 2 root root 4.0K Apr 15 15:49 hunch_qlora.fmadapter\n",
+      "Adapter exported and saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!ls -lh {WORK_DIR}/qlora-exports/\n",
+    "!cp -r {WORK_DIR}/qlora-exports {DRIVE_DIR}/qlora-exports2\n",
+    "!echo 'Adapter exported and saved to Drive'"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/training/train_qlora_full.py b/training/train_qlora_full.py
new file mode 100644
index 0000000..ddf6187
--- /dev/null
+++ b/training/train_qlora_full.py
@@ -0,0 +1,371 @@
+#!/usr/bin/env python3
+"""
+True QLoRA training: 4-bit NF4 base model + fp32 LoRA adapters.
+
+Uses bitsandbytes for NF4 quantization. Trains on hunch dataset.
+Works on 24GB Mac (MPS) and Colab T4 (CUDA). ~5GB GPU memory.
+
+Usage:
+  python3 train_qlora_full.py                          # train 3 epochs
+  python3 train_qlora_full.py --epochs 1 --batch-size 4  # quick test
+  python3 train_qlora_full.py --eval-only --checkpoint checkpoints/adapter-final.pt
+
+Requirements:
+  pip install bitsandbytes psutil
+"""
+
+import sys
+import os
+import gc
+import json
+import time
+import argparse
+import psutil
+from pathlib import Path
+
+TOOLKIT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "adapter_training_toolkit_v26_0_0")
+sys.path.insert(0, TOOLKIT_DIR)
+
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+import tamm.utils.json
+from tamm.tokenizers.afm import AFMTokenizer
+
+ASSETS = Path(TOOLKIT_DIR) / "assets"
+TRAINING_DIR = Path(__file__).parent
+
+
+def patch_rms_norm():
+    """Patch tamm's rms_norm to handle dtype mismatch (fp16 model + fp32 cast)."""
+    import glob
+    patterns = [
+        os.path.join(TOOLKIT_DIR, "venv", "lib", "*", "site-packages", "tamm", "layers", "functional.py"),
+        os.path.join(sys.prefix, "lib", "*", "dist-packages", "tamm", "layers", "functional.py"),
+    ]
+    for pattern in patterns:
+        for path in glob.glob(pattern):
+            code = open(path).read()
+            if "weight.to(tensor.dtype)" not in code:
+                old = "        tensor = _torch_compatibility.rms_norm(\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\n        )"
+                new = "        if weight is not None and weight.dtype != tensor.dtype:\n            weight = weight.to(tensor.dtype)\n        tensor = _torch_compatibility.rms_norm(\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\n        )"
+                code = code.replace(old, new)
+                open(path, "w").write(code)
+                # Clear pycache
+                cache_dir = os.path.join(os.path.dirname(path), "__pycache__")
+                if os.path.exists(cache_dir):
+                    import shutil; shutil.rmtree(cache_dir)
+                print(f"Patched rms_norm: {path}")
+            else:
+                print(f"rms_norm already patched: {path}")
+
+
+def get_device():
+    if torch.cuda.is_available():
+        return torch.device("cuda")
+    if torch.backends.mps.is_available():
+        return torch.device("mps")
+    return torch.device("cpu")
+
+
+def mem_str():
+    ram = psutil.Process().memory_info().rss / 1024**3
+    if torch.cuda.is_available():
+        gpu = torch.cuda.memory_allocated() / 1024**3
+    elif torch.backends.mps.is_available():
+        gpu = torch.mps.current_allocated_memory() / 1024**3
+    else:
+        gpu = 0
+    return f"RAM={ram:.1f}GB GPU={gpu:.1f}GB"
+
+
+def load_model_qlora(device):
+    """Load base model with NF4 quantization."""
+    import bitsandbytes as bnb
+
+    # Load config and create model in fp16 (6GB instead of 12GB)
+    with open(ASSETS / "base-model-config.json") as f:
+        config = tamm.utils.json.load(f)
+    config.dtype = torch.float16
+    model = config.create_model()
+
+    # Load weights via mmap (minimal RAM)
+    sd = torch.load(str(ASSETS / "base-model.pt"), map_location="cpu", mmap=True, weights_only=False)
+    model.load_state_dict(sd, strict=True)
+    del sd; gc.collect()
+
+    # Freeze non-adapter params
+    for name, param in model.named_parameters():
+        param.requires_grad = "adapter" in name
+
+    # Quantize frozen Linear layers to NF4
+    replacements = []
+    for name, module in model.named_modules():
+        if not isinstance(module, nn.Linear):
+            continue
+        if "adapter" in name or any(p.requires_grad for p in module.parameters()):
+            continue
+        replacements.append((name, module))
+
+    for name, module in replacements:
+        new_module = bnb.nn.Linear4bit(
+            module.in_features, module.out_features,
+            bias=module.bias is not None,
+            compute_dtype=torch.float16,
+            quant_type="nf4",
+        )
+        new_module.weight = bnb.nn.Params4bit(
+            module.weight.data, requires_grad=False,
+            quant_type="nf4", compress_statistics=True,
+        )
+        if module.bias is not None:
+            new_module.bias = module.bias
+
+        parts = name.rsplit(".", 1)
+        if len(parts) == 2:
+            parent = dict(model.named_modules())[parts[0]]
+            setattr(parent, parts[1], new_module)
+        else:
+            setattr(model, name, new_module)
+
+    gc.collect()
+    print(f"Quantized {len(replacements)} layers to NF4")
+
+    # Move to device
+    model = model.to(device)
+
+    # Ensure adapter params are fp32
+    for name, param in model.named_parameters():
+        if param.requires_grad and param.dtype != torch.float32:
+            param.data = param.data.float()
+
+    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"Trainable: {trainable/1e6:.0f}M params | {mem_str()}")
+    return model
+
+
+def load_model_with_checkpoint(device, checkpoint_path):
+    """Load QLoRA model and restore adapter weights from checkpoint."""
+    model = load_model_qlora(device)
+    sd = torch.load(checkpoint_path, map_location=device, weights_only=False)
+    # Only load adapter weights
+    adapter_sd = {k: v for k, v in sd.items() if "adapter" in k}
+    model.load_state_dict(adapter_sd, strict=False)
+    print(f"Loaded {len(adapter_sd)} adapter weights from {checkpoint_path}")
+    return model
+
+
+class CommandDataset(Dataset):
+    """Load JSONL training data."""
+    def __init__(self, path, tokenizer, max_length=512):
+        self.examples = []
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+
+        with open(path) as f:
+            for line in f:
+                messages = json.loads(line)
+                # Format: system + user + assistant
+                text = ""
+                for msg in messages:
+                    if msg["role"] == "system":
+                        text += f"system\n{msg['content']}<turn_end> "
+                    elif msg["role"] == "user":
+                        text += f"user\n {msg['content']}<turn_end> "
+                    elif msg["role"] == "assistant":
+                        text += f"assistant\n {msg['content']}<turn_end>"
+                self.examples.append(text)
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, idx):
+        tokens = self.tokenizer.encode(self.examples[idx])
+        tokens = tokens[:self.max_length]
+        return torch.tensor(tokens, dtype=torch.long)
+
+
+def collate_fn(batch):
+    """Pad sequences to same length."""
+    max_len = max(len(x) for x in batch)
+    padded = torch.zeros(len(batch), max_len, dtype=torch.long)
+    for i, x in enumerate(batch):
+        padded[i, :len(x)] = x
+    return padded
+
+
+def train_epoch(model, dataloader, optimizer, device, epoch, scaler=None):
+    model.train()
+    total_loss = 0
+    n_batches = 0
+    start = time.time()
+
+    for i, batch in enumerate(dataloader):
+        input_ids = batch.to(device)
+        labels = input_ids.clone()
+
+        # Forward
+        if scaler:
+            with torch.amp.autocast(device_type=str(device), dtype=torch.float16):
+                output = model(input_ids)
+                logits = output.logits if hasattr(output, 'logits') else output
+                loss = nn.CrossEntropyLoss()(
+                    logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
+                    labels[:, 1:].contiguous().view(-1)
+                )
+        else:
+            output = model(input_ids)
+            logits = output.logits if hasattr(output, 'logits') else output
+            loss = nn.CrossEntropyLoss()(
+                logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
+                labels[:, 1:].contiguous().view(-1)
+            )
+
+        # Backward
+        optimizer.zero_grad()
+        if scaler:
+            scaler.scale(loss).backward()
+            scaler.unscale_(optimizer)
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            optimizer.step()
+
+        total_loss += loss.item()
+        n_batches += 1
+
+        if (i + 1) % 100 == 0:
+            avg = total_loss / n_batches
+            elapsed = time.time() - start
+            it_s = (i + 1) / elapsed
+            remaining = (len(dataloader) - i - 1) / it_s / 60
+            print(f"  [{i+1}/{len(dataloader)}] loss={avg:.3f} {it_s:.1f}it/s ~{remaining:.0f}min left | {mem_str()}")
+
+    return total_loss / max(n_batches, 1)
+
+
+def evaluate(model, dataloader, device):
+    model.eval()
+    total_loss = 0
+    n_batches = 0
+
+    with torch.no_grad():
+        for batch in dataloader:
+            input_ids = batch.to(device)
+            labels = input_ids.clone()
+            with torch.amp.autocast(device_type=str(device), dtype=torch.float16):
+                output = model(input_ids)
+                logits = output.logits if hasattr(output, 'logits') else output
+                loss = nn.CrossEntropyLoss()(
+                    logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
+                    labels[:, 1:].contiguous().view(-1)
+                )
+            total_loss += loss.item()
+            n_batches += 1
+
+    return total_loss / max(n_batches, 1)
+
+
+def save_adapter_checkpoint(model, path, optimizer=None, epoch=None):
+    """Save adapter weights and optionally optimizer state for resume."""
+    checkpoint = {
+        "adapter_weights": {k: v.cpu() for k, v in model.state_dict().items() if "adapter" in k},
+    }
+    if optimizer:
+        checkpoint["optimizer"] = optimizer.state_dict()
+    if epoch is not None:
+        checkpoint["epoch"] = epoch
+    torch.save(checkpoint, path)
+    size_mb = os.path.getsize(path) / 1024**2
+    print(f"Saved checkpoint ({size_mb:.0f}MB) to {path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="QLoRA training for hunch")
+    parser.add_argument("--epochs", type=int, default=3)
+    parser.add_argument("--batch-size", type=int, default=4)
+    parser.add_argument("--learning-rate", type=float, default=1e-4)
+    parser.add_argument("--train-data", default=str(TRAINING_DIR / "train.jsonl"))
+    parser.add_argument("--eval-data", default=str(TRAINING_DIR / "eval.jsonl"))
+    parser.add_argument("--checkpoint-dir", default=str(TRAINING_DIR / "qlora-checkpoints"))
+    parser.add_argument("--checkpoint", type=str, help="Resume from checkpoint")
+    parser.add_argument("--eval-only", action="store_true")
+    args = parser.parse_args()
+
+    device = get_device()
+    print(f"Device: {device} | {mem_str()}")
+
+    # Patch rms_norm for fp16 compatibility
+    patch_rms_norm()
+
+    # Generate training data if needed
+    if not os.path.exists(args.train_data):
+        print("Generating training data...")
+        os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py")
+
+    # Load tokenizer
+    tokenizer = AFMTokenizer(str(ASSETS / "tokenizer.model"))
+
+    # Load model
+    if args.checkpoint:
+        model = load_model_with_checkpoint(device, args.checkpoint)
+    else:
+        model = load_model_qlora(device)
+
+    if args.eval_only:
+        eval_dataset = CommandDataset(args.eval_data, tokenizer)
+        eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn)
+        eval_loss = evaluate(model, eval_loader, device)
+        print(f"Eval loss: {eval_loss:.4f}")
+        return
+
+    # Data
+    train_dataset = CommandDataset(args.train_data, tokenizer)
+    eval_dataset = CommandDataset(args.eval_data, tokenizer)
+    train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, collate_fn=collate_fn)
+    eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn)
+
+    print(f"Train: {len(train_dataset)} examples, {len(train_loader)} batches")
+    print(f"Eval: {len(eval_dataset)} examples")
+
+    # Optimizer
+    optimizer = torch.optim.AdamW(
+        [p for p in model.parameters() if p.requires_grad],
+        lr=args.learning_rate,
+        weight_decay=0.01
+    )
+
+    # Gradient scaler for mixed precision on CUDA
+    scaler = torch.amp.GradScaler() if torch.cuda.is_available() else None
+
+    # Checkpoint dir
+    os.makedirs(args.checkpoint_dir, exist_ok=True)
+
+    # Training loop
+    print(f"\n{'='*60}")
+    print(f"Training: {args.epochs} epochs, batch {args.batch_size}, lr {args.learning_rate}")
+    print(f"{'='*60}")
+
+    for epoch in range(args.epochs):
+        print(f"\nEpoch {epoch+1}/{args.epochs}")
+        train_loss = train_epoch(model, train_loader, optimizer, device, epoch, scaler)
+
+        # Save checkpoint before eval (in case eval crashes)
+        ckpt_path = os.path.join(args.checkpoint_dir, f"adapter-epoch{epoch+1}.pt")
+        save_adapter_checkpoint(model, ckpt_path)
+
+        eval_loss = evaluate(model, eval_loader, device)
+        print(f"  Train loss: {train_loss:.4f} | Eval loss: {eval_loss:.4f} | {mem_str()}")
+
+    # Save final
+    final_path = os.path.join(args.checkpoint_dir, "adapter-final.pt")
+    save_adapter_checkpoint(model, final_path)
+    print(f"\nDone! Export with:")
+    print(f"  python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint {final_path} --output-dir {args.checkpoint_dir}/")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/training/train_qlora_test.py b/training/train_qlora_test.py
new file mode 100644
index 0000000..7424862
--- /dev/null
+++ b/training/train_qlora_test.py
@@ -0,0 +1,216 @@
+#!/usr/bin/env python3
+"""
+True QLoRA training test: 4-bit NF4 base model + fp32 LoRA adapters.
+
+Uses bitsandbytes for NF4 quantization. Tests loading + one training step.
+
+Usage:
+  pip install bitsandbytes
+  python3 train_qlora_test.py
+
+This script:
+  1. Loads the base model
+  2. Replaces frozen Linear layers with 4-bit NF4 equivalents
+  3. Runs one training batch to verify it works
+  4. Reports memory usage at each step
+"""
+
+import sys
+import os
+import gc
+import time
+import psutil
+
+TOOLKIT_DIR = os.path.join(os.path.dirname(__file__), "adapter_training_toolkit_v26_0_0")
+sys.path.insert(0, TOOLKIT_DIR)
+
+import torch
+import tamm.utils.json
+from pathlib import Path
+
+ASSETS = Path(TOOLKIT_DIR) / "assets"
+
+
+def mem():
+    return psutil.Process().memory_info().rss / 1024**3
+
+def gpu_mem():
+    if torch.cuda.is_available():
+        return torch.cuda.memory_allocated() / 1024**3
+    elif torch.backends.mps.is_available():
+        return torch.mps.current_allocated_memory() / 1024**3
+    return 0
+
+def get_device():
+    if torch.cuda.is_available():
+        return torch.device("cuda")
+    if torch.backends.mps.is_available():
+        return torch.device("mps")
+    return torch.device("cpu")
+
+
+def quantize_linear_to_4bit(model):
+    """Replace frozen nn.Linear layers with bitsandbytes 4-bit Linear."""
+    try:
+        import bitsandbytes as bnb
+    except ImportError:
+        print("ERROR: pip install bitsandbytes")
+        sys.exit(1)
+
+    quantized = 0
+    skipped = 0
+
+    # Collect replacements (can't modify during iteration)
+    replacements = []
+    for name, module in model.named_modules():
+        if not isinstance(module, torch.nn.Linear):
+            continue
+        if "adapter" in name:
+            skipped += 1
+            continue
+        if any(p.requires_grad for p in module.parameters()):
+            skipped += 1
+            continue
+        replacements.append((name, module))
+
+    # Apply replacements
+    for name, module in replacements:
+        # Create 4-bit linear
+        new_module = bnb.nn.Linear4bit(
+            module.in_features,
+            module.out_features,
+            bias=module.bias is not None,
+            compute_dtype=torch.float16,
+            quant_type="nf4",
+        )
+
+        # Quantize weights
+        new_module.weight = bnb.nn.Params4bit(
+            module.weight.data,
+            requires_grad=False,
+            quant_type="nf4",
+            compress_statistics=True,
+        )
+        if module.bias is not None:
+            new_module.bias = module.bias
+
+        # Replace in parent module
+        parts = name.rsplit(".", 1)
+        if len(parts) == 2:
+            parent_name, child_name = parts
+            parent = dict(model.named_modules())[parent_name]
+            setattr(parent, child_name, new_module)
+        else:
+            setattr(model, name, new_module)
+
+        quantized += 1
+
+    # Free memory
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+
+    print(f"QLoRA: quantized {quantized} layers to NF4, skipped {skipped}")
+    return model
+
+
+def main():
+    device = get_device()
+    print(f"Device: {device}")
+    print(f"System RAM: {psutil.virtual_memory().total / 1024**3:.0f}GB")
+    print(f"Before: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 1: Load model config
+    with open(ASSETS / "base-model-config.json") as f:
+        config = tamm.utils.json.load(f)
+
+    # Step 2: Create model on CPU
+    print("\n--- Creating model ---")
+    model = config.create_model()
+    print(f"After create_model: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 3: Load weights via mmap
+    print("\n--- Loading weights (mmap) ---")
+    sd = torch.load(str(ASSETS / "base-model.pt"), map_location="cpu", mmap=True, weights_only=False)
+    model.load_state_dict(sd, strict=True)
+    del sd; gc.collect()
+    print(f"After load+del: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 4: Freeze non-adapter params
+    for name, param in model.named_parameters():
+        param.requires_grad = "adapter" in name
+
+    trainable_before = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    frozen_before = sum(p.numel() for p in model.parameters() if not p.requires_grad)
+    print(f"Trainable: {trainable_before/1e6:.0f}M, Frozen: {frozen_before/1e6:.0f}M")
+
+    # Step 5: Quantize frozen layers to 4-bit NF4
+    print("\n--- Quantizing to NF4 ---")
+    model = quantize_linear_to_4bit(model)
+    gc.collect()
+    print(f"After quantize: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 6: Move to device
+    print(f"\n--- Moving to {device} ---")
+    model = model.to(device)
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    print(f"After to({device}): RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 7: Verify trainable params are fp32
+    for name, param in model.named_parameters():
+        if param.requires_grad and "adapter" in name:
+            if param.dtype != torch.float32:
+                param.data = param.data.float()
+
+    trainable_after = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"Trainable params: {trainable_after/1e6:.0f}M")
+
+    # Step 8: Test one forward + backward pass
+    print("\n--- Test forward/backward ---")
+    try:
+        tokenizer_path = ASSETS / "tokenizer.model"
+        from tamm.tokenizers.afm import AFMTokenizer
+        tokenizer = AFMTokenizer(str(tokenizer_path))
+
+        # Create a simple input
+        text = "Output a single shell command for zsh on macOS.\nfind files changed in the last hour"
+        tokens = tokenizer.encode(text)
+        input_ids = torch.tensor([tokens[:50]], device=device)
+        labels = input_ids.clone()
+
+        # Forward pass
+        output = model(input_ids)
+        if hasattr(output, 'logits'):
+            logits = output.logits
+        else:
+            logits = output
+
+        # Compute loss
+        loss_fn = torch.nn.CrossEntropyLoss()
+        shift_logits = logits[:, :-1, :].contiguous()
+        shift_labels = labels[:, 1:].contiguous()
+        loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+        print(f"Loss: {loss.item():.4f}")
+
+        # Backward pass
+        loss.backward()
+        print(f"After backward: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+        # Check gradients exist on adapter params
+        grad_count = sum(1 for p in model.parameters() if p.grad is not None)
+        print(f"Params with gradients: {grad_count}")
+
+        print("\nSUCCESS: QLoRA forward + backward works!")
+
+    except Exception as e:
+        print(f"\nFailed at forward/backward: {e}")
+        import traceback
+        traceback.print_exc()
+
+    print(f"\nFinal: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+
+if __name__ == "__main__":
+    main()

From 70e79b8a73e05b2db2f79e9c6e38a13164a1efe2 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Thu, 16 Apr 2026 16:43:57 -0400
Subject: [PATCH 04/11] Add --batch and --runs flags for single-process adapter
 benchmarking

Works around TGOnDeviceInferenceProviderService disk leak where each
process invocation caches ~160MB of the adapter. Batch mode loads the
adapter once and runs all prompts in a single process. 4 runs of 100
prompts = 1 cached copy instead of 400.
---
 benchmark/run.py             | 132 +++++++++++++++++++++++++++++++--
 cli/Sources/Hunch/main.swift | 137 +++++++++++++++++++++++++++++++++++
 2 files changed, 263 insertions(+), 6 deletions(-)

diff --git a/benchmark/run.py b/benchmark/run.py
index b8328c6..3bed7bd 100755
--- a/benchmark/run.py
+++ b/benchmark/run.py
@@ -511,44 +511,111 @@ def approach_hunch_multi_warm(prompt):
 LORA_OVERRIDE_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "exports" / "hunch_overrides.fmadapter")
 
 
+def _run_hunch_batch(prompts, extra_args=None, runs=1):
+    """Run all prompts in a single hunch process using --batch mode.
+
+    This avoids the TGOnDeviceInferenceProviderService disk leak where each
+    process invocation caches a ~160MB copy of the adapter.
+
+    Returns: dict keyed by (run, id) if runs > 1, or by id if runs == 1.
+    """
+    # Write prompts to a temp JSONL file
+    import tempfile
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
+        for p in prompts:
+            f.write(json.dumps({"id": p["id"], "prompt": p["prompt"]}) + "\n")
+        batch_path = f.name
+
+    cmd = ["hunch"]
+    if extra_args:
+        cmd.extend(extra_args)
+    cmd.extend(["--batch", batch_path])
+    if runs > 1:
+        cmd.extend(["--runs", str(runs)])
+
+    try:
+        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+        results = {}
+        count = 0
+        total = len(prompts) * runs
+        for line in proc.stdout:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                r = json.loads(line)
+                count += 1
+                status = r.get("result", "")[:40]
+                print(f"  [{count}/{total}] #{r.get('id', '?'):3d}: {r.get('prompt', '')[:50]:50s} → {status} ({r.get('total_time', 0)}s)")
+                if runs > 1:
+                    results[(r["run"], r["id"])] = r
+                else:
+                    results[r["id"]] = r
+            except (json.JSONDecodeError, KeyError):
+                continue
+        proc.wait()
+        return results
+    except Exception:
+        return {}
+    finally:
+        os.unlink(batch_path)
+
+
+def _make_batch_approach(extra_args):
+    """Create a batch-aware approach function for adapter benchmarks."""
+    def approach(prompt):
+        # Fallback for single-prompt calls (e.g. --ids)
+        return _run_hunch(prompt, extra_args)
+    approach._batch_args = extra_args
+    return approach
+
+
 def approach_adapter_only(prompt):
     """LoRA adapter only, no retrieval."""
     return _run_hunch(prompt, ["--adapter", ADAPTER_PATH, "--limit", "0"])
+approach_adapter_only._batch_args = ["--adapter", ADAPTER_PATH, "--limit", "0"]
 
 
 def approach_adapter_retrieval(prompt):
     """LoRA adapter + retrieval."""
     return _run_hunch(prompt, ["--adapter", ADAPTER_PATH])
+approach_adapter_retrieval._batch_args = ["--adapter", ADAPTER_PATH]
 
 
 def approach_fp16lora_only(prompt):
     """fp16 LoRA adapter only, no retrieval."""
     return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH, "--limit", "0"])
+approach_fp16lora_only._batch_args = ["--adapter", QLORA_FP16_ADAPTER_PATH, "--limit", "0"]
 
 
 def approach_fp16lora_retrieval(prompt):
     """fp16 LoRA adapter + retrieval."""
     return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH])
+approach_fp16lora_retrieval._batch_args = ["--adapter", QLORA_FP16_ADAPTER_PATH]
 
 
 def approach_qlora_only(prompt):
     """True QLoRA (NF4) adapter only, no retrieval."""
     return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH, "--limit", "0"])
+approach_qlora_only._batch_args = ["--adapter", QLORA_NF4_ADAPTER_PATH, "--limit", "0"]
 
 
 def approach_qlora_retrieval(prompt):
     """True QLoRA (NF4) adapter + retrieval."""
     return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH])
+approach_qlora_retrieval._batch_args = ["--adapter", QLORA_NF4_ADAPTER_PATH]
 
 
 def approach_qlora_override_only(prompt):
     """QLoRA trained on overrides only, no retrieval."""
     return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"])
+approach_qlora_override_only._batch_args = ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]
 
 
 def approach_qlora_override_retrieval(prompt):
     """QLoRA trained on overrides only + retrieval."""
     return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH])
+approach_qlora_override_retrieval._batch_args = ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH]
 
 
 def approach_dynshot_tldr(prompt):
@@ -631,8 +698,8 @@ def approach_dynshot_holdout(prompt):
     "qlora-retrieval": approach_qlora_retrieval,
     "qlora-override-only": approach_qlora_override_only,
     "qlora-override-retrieval": approach_qlora_override_retrieval,
-    "lora-override-only": lambda p: _run_hunch(p, ["--adapter", LORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]),
-    "lora-override-retrieval": lambda p: _run_hunch(p, ["--adapter", LORA_OVERRIDE_ADAPTER_PATH]),
+    "lora-override-only": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]),
+    "lora-override-retrieval": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH]),
     "hunch-sc": approach_hunch_sc,
     "sc-dynshot": approach_selfconsist_dynshot,
     "sc-warm": approach_selfconsist_warm,
@@ -652,14 +719,51 @@ def load_prompts(ids=None, category=None):
     return prompts
 
 
-def run_benchmark(approach_name, prompts):
+def run_benchmark(approach_name, prompts, suffix="", runs=1):
     func = APPROACHES[approach_name]
-    outfile = RESULTS_DIR / f"{approach_name}.jsonl"
 
     print(f"\n{'=' * 60}")
-    print(f"  APPROACH: {approach_name} ({len(prompts)} prompts)")
+    print(f"  APPROACH: {approach_name} ({len(prompts)} prompts{f', {runs} runs' if runs > 1 else ''})")
     print(f"{'=' * 60}")
 
+    # Use batch mode for adapter approaches (avoids disk leak)
+    batch_args = getattr(func, '_batch_args', None)
+    if batch_args and len(prompts) > 1:
+        print(f"  Using --batch mode (single process, avoids adapter disk leak)")
+        batch_results = _run_hunch_batch(prompts, batch_args, runs=runs)
+
+        all_results = []
+        for run_num in range(1, runs + 1):
+            run_suffix = f"-run{run_num}" if runs > 1 else ""
+            outfile = RESULTS_DIR / f"{approach_name}{suffix}{run_suffix}.jsonl"
+
+            results = []
+            with open(outfile, "w") as f:
+                for p in prompts:
+                    if runs > 1:
+                        br = batch_results.get((run_num, p["id"]), {})
+                    else:
+                        br = batch_results.get(p["id"], {})
+                    r = {
+                        "result": br.get("result", "[BATCH_ERROR]"),
+                        "total_time": br.get("total_time", 0),
+                    }
+                    r["id"] = p["id"]
+                    r["approach"] = approach_name
+                    r["prompt"] = p["prompt"]
+                    r["expected"] = p["expected"]
+                    r["category"] = p["category"]
+
+                    f.write(json.dumps(r) + "\n")
+                    f.flush()
+                    results.append(r)
+
+            print(f"  Saved to {outfile}")
+            all_results.extend(results)
+
+        return all_results
+
+    outfile = RESULTS_DIR / f"{approach_name}{suffix}.jsonl"
     results = []
     with open(outfile, "w") as f:
         for i, p in enumerate(prompts):
@@ -694,6 +798,7 @@ def main():
     parser.add_argument("approach", nargs="?", default="all", help="Approach name or 'all'")
     parser.add_argument("--ids", help="Comma-separated prompt IDs")
     parser.add_argument("--category", help="Filter by category: simple, flags, composed")
+    parser.add_argument("--runs", type=int, default=1, help="Number of runs (output files suffixed -run1, -run2, ...)")
     args = parser.parse_args()
 
     ids = [int(x) for x in args.ids.split(",")] if args.ids else None
@@ -712,7 +817,22 @@ def main():
         if a not in APPROACHES:
             print(f"Unknown approach: {a}. Available: {', '.join(APPROACHES.keys())}")
             sys.exit(1)
-        run_benchmark(a, prompts)
+
+    for a in approaches:
+        func = APPROACHES[a]
+        batch_args = getattr(func, '_batch_args', None)
+        if batch_args and args.runs > 1:
+            # Adapter approaches: all runs in one process
+            run_benchmark(a, prompts, runs=args.runs)
+        elif args.runs > 1:
+            # Non-adapter approaches: loop externally
+            for run_num in range(1, args.runs + 1):
+                print(f"\n{'#' * 60}")
+                print(f"  RUN {run_num}/{args.runs}")
+                print(f"{'#' * 60}")
+                run_benchmark(a, prompts, suffix=f"-run{run_num}")
+        else:
+            run_benchmark(a, prompts)
 
     print(f"\nDone. Run: python3 score.py")
 
diff --git a/cli/Sources/Hunch/main.swift b/cli/Sources/Hunch/main.swift
index ba6a296..09ce597 100644
--- a/cli/Sources/Hunch/main.swift
+++ b/cli/Sources/Hunch/main.swift
@@ -159,6 +159,8 @@ struct Hunch {
         let limit = parseFlag(&args, flag: "--limit").flatMap(Int.init) ?? 8
         let guided = parseFlag(&args, flag: "--guided")
         let adapterPath = parseFlag(&args, flag: "--adapter")
+        let batchFile = parseFlag(&args, flag: "--batch")
+        let batchRuns = parseFlag(&args, flag: "--runs").flatMap(Int.init) ?? 1
 
         // Parse mode
         var mode: Mode = .suggest
@@ -170,6 +172,20 @@ struct Hunch {
             args.removeFirst()
         }
 
+        // Batch mode: read prompts from JSONL, run all in one process
+        if let batchFile {
+            do {
+                try await runBatch(
+                    file: batchFile, adapterPath: adapterPath, temperature: temperature,
+                    limit: limit, guided: guided, runs: batchRuns
+                )
+            } catch {
+                fputs("error: \(error.localizedDescription)\n", stderr)
+                exit(1)
+            }
+            return
+        }
+
         guard !args.isEmpty else {
             printUsage()
             return
@@ -391,6 +407,127 @@ struct Hunch {
         }
     }
 
+    static func runBatch(
+        file: String, adapterPath: String?, temperature: Double?,
+        limit: Int, guided: String?, runs: Int = 1
+    ) async throws {
+        // Read JSONL file
+        let contents = try String(contentsOfFile: file, encoding: .utf8)
+        let lines = contents.components(separatedBy: .newlines).filter { !$0.isEmpty }
+
+        // Load model once
+        let model: SystemLanguageModel
+        if let adapterPath {
+            let adapterURL = URL(fileURLWithPath: adapterPath)
+            let adapter = try SystemLanguageModel.Adapter(fileURL: adapterURL)
+            model = SystemLanguageModel(
+                adapter: adapter,
+                guardrails: .permissiveContentTransformations
+            )
+        } else {
+            model = SystemLanguageModel(
+                guardrails: .permissiveContentTransformations
+            )
+        }
+
+        let genOptions: GenerationOptions? = temperature.map {
+            var opts = GenerationOptions()
+            opts.temperature = $0
+            return opts
+        }
+
+        let dbPath = findDatabase()
+
+        for run in 1...runs {
+        for line in lines {
+            guard let data = line.data(using: .utf8),
+                  let entry = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
+                  let idValue = entry["id"], let id = idValue as? Int ?? (idValue as? NSNumber)?.intValue,
+                  let prompt = entry["prompt"] as? String else {
+                continue
+            }
+
+            let start = CFAbsoluteTimeGetCurrent()
+            var result: String
+
+            do {
+                let examples = dbPath != nil
+                    ? searchBank(dbPath: dbPath!, query: prompt, limit: limit)
+                    : []
+                let systemPrompt = buildSystemPrompt(mode: .suggest, examples: examples)
+
+                let session: LanguageModelSession
+                if !systemPrompt.isEmpty {
+                    let segment = Transcript.TextSegment(content: systemPrompt)
+                    let instructions = Transcript.Instructions(
+                        segments: [.text(segment)],
+                        toolDefinitions: []
+                    )
+                    session = LanguageModelSession(
+                        model: model,
+                        transcript: Transcript(entries: [.instructions(instructions)])
+                    )
+                } else {
+                    session = LanguageModelSession(model: model)
+                }
+
+                if guided == "plain" {
+                    let response: LanguageModelSession.Response<ShellCommand>
+                    if let opts = genOptions {
+                        response = try await session.respond(to: prompt, generating: ShellCommand.self, options: opts)
+                    } else {
+                        response = try await session.respond(to: prompt, generating: ShellCommand.self)
+                    }
+                    result = response.content.command
+                } else if guided == "cot" {
+                    let response: LanguageModelSession.Response<ShellCommandCoT>
+                    if let opts = genOptions {
+                        response = try await session.respond(to: prompt, generating: ShellCommandCoT.self, options: opts)
+                    } else {
+                        response = try await session.respond(to: prompt, generating: ShellCommandCoT.self)
+                    }
+                    result = response.content.command
+                } else if guided == "multi" {
+                    let response: LanguageModelSession.Response<ShellCommandMulti>
+                    if let opts = genOptions {
+                        response = try await session.respond(to: prompt, generating: ShellCommandMulti.self, options: opts)
+                    } else {
+                        response = try await session.respond(to: prompt, generating: ShellCommandMulti.self)
+                    }
+                    result = majorityVote([response.content.first, response.content.second, response.content.third])
+                } else {
+                    // Default: plain string
+                    let response: LanguageModelSession.Response<String>
+                    if let opts = genOptions {
+                        response = try await session.respond(to: prompt, options: opts)
+                    } else {
+                        response = try await session.respond(to: prompt)
+                    }
+                    result = stripMarkdown(response.content)
+                }
+            } catch {
+                result = "[ERROR] \(error.localizedDescription)"
+            }
+
+            let elapsed = round((CFAbsoluteTimeGetCurrent() - start) * 100) / 100
+            var output: [String: Any] = [
+                "id": id,
+                "prompt": prompt,
+                "result": result,
+                "total_time": elapsed
+            ]
+            if runs > 1 {
+                output["run"] = run
+            }
+            if let jsonData = try? JSONSerialization.data(withJSONObject: output),
+               let jsonString = String(data: jsonData, encoding: .utf8) {
+                print(jsonString)
+                fflush(stdout)
+            }
+        }
+        } // end runs loop
+    }
+
     static func printUsage() {
         let dbStatus = findDatabase() != nil ? "found" : "not found"
         let envTemp = ProcessInfo.processInfo.environment["HUNCH_TEMPERATURE"] ?? "not set"

From 2a1b8beb6b950e8f3d22553c607f9378522e73a2 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Thu, 16 Apr 2026 16:44:07 -0400
Subject: [PATCH 05/11] Update accepted answers from 20-run adapter benchmark
 review

Reviewed all non-exact results across 5 approaches x 4 runs.
Added accepted alternates for placeholder variations, flag reordering,
and equivalent commands.
---
 benchmark/alternates.json | 158 +++++++++++++++++++++++++++++---------
 1 file changed, 122 insertions(+), 36 deletions(-)

diff --git a/benchmark/alternates.json b/benchmark/alternates.json
index 239acf9..1319f7d 100644
--- a/benchmark/alternates.json
+++ b/benchmark/alternates.json
@@ -10,7 +10,9 @@
     "ls",
     "ls -la",
     "ls -a",
-    "ls -l"
+    "ls -l",
+    "ls -1",
+    "ls ."
   ],
   "3": [
     "df -h",
@@ -62,7 +64,8 @@
     "find . -name '*.png'",
     "find . -name \"*.png\"",
     "find . -iname '*.png'",
-    "find . -type f -name '*.png'"
+    "find . -type f -name '*.png'",
+    "find . -type f -name \"*.png\""
   ],
   "14": [
     "find . -type d -empty",
@@ -83,7 +86,9 @@
     "find . -name '.DS_Store' -delete",
     "find . -name .DS_Store -delete",
     "find . -name '.DS_Store' -exec rm {} +",
-    "find . -name '.DS_Store' -exec rm {} \\;"
+    "find . -name '.DS_Store' -exec rm {} \\;",
+    "find . -name \".DS_Store\" -delete",
+    "find . -name \".DS_Store\" -exec rm {} \\;"
   ],
   "18": [
     "find . -type l"
@@ -97,13 +102,16 @@
     "find . -type d -name 'node_modules'",
     "find -name 'node_modules'",
     "find . -name 'node_modules'",
-    "find . -name \"node_modules\""
+    "find . -name \"node_modules\"",
+    "find . -name node_modules"
   ],
   "21": [
     "find . -name '*.py' -exec wc -l {} +",
     "find . -name '*.py' | xargs wc -l",
     "wc -l **/*.py",
-    "find . -name '*.py' -exec wc -l {} \\;"
+    "find . -name '*.py' -exec wc -l {} \\;",
+    "find . -name \"*.py\" -exec wc -l {} +",
+    "find . -name \"*.py\" | xargs wc -l"
   ],
   "22": [
     "du -sh * | sort -hr",
@@ -113,7 +121,8 @@
   "23": [
     "kill $(lsof -t -i :3000)",
     "lsof -t -i :3000 | xargs kill",
-    "fuser -k 3000/tcp"
+    "fuser -k 3000/tcp",
+    "kill $(lsof -t -i :3000 )"
   ],
   "24": [
     "find . -size +1G",
@@ -139,7 +148,13 @@
     "tar -czf compressed_folder.tar.gz ./",
     "tar -czf folder.tar.gz /path/to/folder",
     "tar czf archive.tar.gz /path/to/folder",
-    "tar czf folder.tar.gz folder"
+    "tar czf folder.tar.gz folder",
+    "tar -czf file.tar.gz folder",
+    "tar -czf folder.tar.gz .",
+    "tar -czf folder.tar.gz folder",
+    "tar -czvf folder.tar.gz folder",
+    "tar -czf path/to/compressed.tar.gz path/to/folder",
+    "tar -czvf /path/to/output.tar.gz /path/to/folder"
   ],
   "28": [
     "tar xzf file.tar.gz",
@@ -148,14 +163,21 @@
     "tar -xf archive.tar.gz",
     "tar xzvf archive.tar.gz",
     "tar -xvzf file.tar.gz",
-    "tar xvf file.tar.gz"
+    "tar xvf file.tar.gz",
+    "tar -xvf archive.tar.gz",
+    "tar -xvzf archive.tar.gz",
+    "tar -xvzf filename.tar.gz",
+    "tar -xzvf file.tar.gz",
+    "tar -zxvf archive.tar.gz",
+    "tar -zxvf file.tar.gz"
   ],
   "29": [
     "git branch --sort=-committerdate",
     "git branch -a --sort=-committerdate"
   ],
   "30": [
-    "git log --oneline"
+    "git log --oneline",
+    "git log --pretty=oneline"
   ],
   "31": [
     "git diff --staged",
@@ -185,7 +207,8 @@
   "36": [
     "netstat -an",
     "netstat",
-    "lsof -i"
+    "lsof -i",
+    "netstat -ln"
   ],
   "37": [
     "lsof -i :8080",
@@ -197,7 +220,11 @@
     "curl -o file https://example.com/file",
     "wget https://example.com/file",
     "curl -O url",
-    "curl -o filename url"
+    "curl -o filename url",
+    "curl -O https://example.com/file.txt",
+    "curl -O https://example.com/file.zip",
+    "curl -o file.zip https://example.com/file.zip",
+    "wget https://example.com/file.pdf"
   ],
   "39": [
     "curl -I https://example.com",
@@ -210,7 +237,11 @@
     "curl -X POST https://example.com/api/endpoint -H 'Content-Type: application/json' -d '{\"key\": \"value\"}'",
     "curl -X POST -H 'Content-Type: application/json' -d '{\"key\": \"value\"}' https://example.com",
     "curl -X POST https://example.com -H 'Content-Type: application/json' -d '{\"key1\": \"value1\", \"key2\": \"value2\"}'",
-    "curl -X POST -H 'Content-Type: application/json' -d '{\"name\": \"john\", \"age\": 25}' https://example.com"
+    "curl -X POST -H 'Content-Type: application/json' -d '{\"name\": \"john\", \"age\": 25}' https://example.com",
+    "curl -X POST -H \"Content-Type: application/json\" -d '{\"key\": \"value\"}' https://example.com",
+    "curl -X POST http://localhost:8000/api/endpoint -H \"Content-Type: application/json\" -d '{\"key\": \"value\"}'",
+    "curl -X POST http://localhost:8000/api/post -H \"Content-Type: application/json\" -d '{\"key\":\"value\"}'",
+    "curl -X POST https://api.example.com/endpoint -H \"Content-Type: application/json\" -d '{\"key1\": \"value1\", \"key2\": \"value2\"}'"
   ],
   "41": [
     "tail -f logfile",
@@ -249,25 +280,32 @@
     "cp -R src dst",
     "cp -a src dst",
     "cp -r source_directory destination_directory",
-    "cp -r path/to/source_directory path/to/target_directory"
+    "cp -r path/to/source_directory path/to/target_directory",
+    "cp -R source_directory destination_directory",
+    "cp -r src/ dst/",
+    "cp -r src/ dest/"
   ],
   "48": [
     "mkdir -p path/to/dir",
     "mkdir -p /path/to/create/directory",
     "mkdir -p \"path/to/directory\"",
-    "mkdir -p parent_directory_path"
+    "mkdir -p parent_directory_path",
+    "mkdir -p /path/to/directory",
+    "mkdir -p path/to/directory"
   ],
   "49": [
     "chmod +x file",
     "chmod 755 file",
-    "chmod +x executable"
+    "chmod +x executable",
+    "chmod +x filename"
   ],
   "50": [
     "stat -f '%A' file",
     "stat -f '%Lp' file"
   ],
   "51": [
-    "find . -type f -exec md5 {} + | sort | uniq -d"
+    "find . -type f -exec md5 {} + | sort | uniq -d",
+    "find . -type f -exec md5 {} + | sort | uniq -d | sort"
   ],
   "52": [
     "top",
@@ -287,18 +325,23 @@
   "55": [
     "pkill processname",
     "killall processname",
-    "pkill process-name"
+    "pkill process-name",
+    "pkill myprocess",
+    "pkill bash",
+    "pkill shell_name"
   ],
   "56": [
     "dig example.com",
     "nslookup example.com",
-    "host example.com"
+    "host example.com",
+    "dig domain.com"
   ],
   "57": [
     "nc -zv host 80",
     "nc -z host 80",
     "nmap -p 80 host",
-    "nc -zv hostname port"
+    "nc -zv hostname port",
+    "nc -zv hostname 80"
   ],
   "58": [
     "openssl rand -base64 32",
@@ -307,7 +350,8 @@
   ],
   "59": [
     "md5 file",
-    "md5sum file"
+    "md5sum file",
+    "md5 file.txt"
   ],
   "60": [
     "shasum -a 256 file",
@@ -336,17 +380,21 @@
   ],
   "65": [
     "caffeinate",
-    "caffeinate -d"
+    "caffeinate -d",
+    "caffeinate -t 3600"
   ],
   "66": [
     "say hello",
     "say 'hello'",
     "say \"hello\"",
     "say 'Hello, world!'",
-    "say 'hello world'"
+    "say 'hello world'",
+    "say \"Hello, world!\"",
+    "say \"hello world\""
   ],
   "67": [
-    "pmset -g batt"
+    "pmset -g batt",
+    "pmset -g"
   ],
   "68": [
     "sudo dscacheutil -flushcache",
@@ -380,7 +428,10 @@
   "75": [
     "sips -s format png input.jpg --out output.png",
     "convert input.jpg output.png",
-    "convert path/to/input_image.jpg path/to/output_image.png"
+    "convert path/to/input_image.jpg path/to/output_image.png",
+    "sips -s format jpeg input.png --out output.jpeg",
+    "sips -s format png input.jpg",
+    "sips -s format webp input.jpg --out output.webp"
   ],
   "76": [
     "sips --resampleWidth 800 image.jpg",
@@ -405,7 +456,10 @@
     "ln -s source_path target_path",
     "ln -s source destination",
     "ln -s /path/to/file /path/to/symlink",
-    "ln -s path/to/file_or_directory path/to/symlink"
+    "ln -s path/to/file_or_directory path/to/symlink",
+    "ln -s source_path destination_path",
+    "ln -s src dest",
+    "ln -s src dst"
   ],
   "81": [
     "lsof -i -P -n | grep LISTEN",
@@ -425,12 +479,14 @@
     "git cherry-pick <commit>",
     "git cherry-pick commit",
     "git cherry-pick <commit-hash>",
-    "git cherry-pick HEAD~1"
+    "git cherry-pick HEAD~1",
+    "git cherry-pick HEAD^"
   ],
   "85": [
     "ls -lh file",
     "ls -lh",
-    "du -sh file"
+    "du -sh file",
+    "du -h file"
   ],
   "86": [
     "find . -perm 777",
@@ -441,20 +497,30 @@
   "87": [
     "head -n 20 file",
     "head -20 file",
-    "head -n 20 < filename"
+    "head -n 20 < filename",
+    "head -20 filename",
+    "cat file | head -20"
   ],
   "88": [
     "ssh -i key.pem user@host",
     "ssh -i path/to/key_file username@remote_host",
     "ssh -i /path/to/key username@host",
     "ssh username@host -i path/to/key",
-    "ssh username@remote_host -i path/to/key_file"
+    "ssh username@remote_host -i path/to/key_file",
+    "ssh user@hostname -i path/to/key_file.pem",
+    "ssh user@server -i path/to/key.pem",
+    "ssh user@server -i path/to/key_file.pem",
+    "ssh user@server.example.com -i ~/.ssh/id_rsa"
   ],
   "89": [
     "rsync -avz src/ user@host:dst/",
     "rsync -avz src/ user@host:dst",
     "rsync -avz /path/to/source /path/to/destination",
-    "rsync -avz source_path destination_path"
+    "rsync -avz source_path destination_path",
+    "rsync -avz source_directory remote_server",
+    "rsync -avz . remote_server:destination_directory",
+    "rsync -avz --progress source_directory remote_server",
+    "rsync -avz --progress src_dir remote_server"
   ],
   "90": [
     "crontab -l"
@@ -462,7 +528,8 @@
   "91": [
     "grep -ri pattern .",
     "grep -ri 'pattern' .",
-    "grep -rni pattern ."
+    "grep -rni pattern .",
+    "find . -type f -exec grep -ri 'pattern' {} +"
   ],
   "92": [
     "wc file",
@@ -473,7 +540,14 @@
     "zip -r archive.zip directory",
     "zip -r archive.zip dir",
     "zip -r archive.zip directory/",
-    "zip -r /path/to/directory.zip /path/to/directory"
+    "zip -r /path/to/directory.zip /path/to/directory",
+    "zip -r archive.zip .",
+    "zip -r archive.zip ./",
+    "zip -r directory_name.zip directory",
+    "zip -r file.zip .",
+    "zip -r file.zip ./",
+    "zip -r mydir.zip ./mydir",
+    "zip -r myfile.zip mydir"
   ],
   "94": [
     "unzip file.zip -d directory",
@@ -483,7 +557,13 @@
     "unzip -d /path/to/output file.zip",
     "unzip -d destination file.zip",
     "unzip filename -d destination_directory",
-    "unzip filename -d destination"
+    "unzip filename -d destination",
+    "unzip -d target_dir filename",
+    "unzip file.zip -d /path/to/directory",
+    "unzip file.zip -d /path/to/unzip",
+    "unzip file.zip -d destination",
+    "unzip file.zip -d destination_directory",
+    "unzip file.zip -d /path/to/unzipped"
   ],
   "95": [
     "system_profiler SPHardwareDataType"
@@ -492,7 +572,9 @@
     "system_profiler SPUSBDataType"
   ],
   "97": [
-    "date -r 1700000000"
+    "date -r 1700000000",
+    "date -r $TIMESTAMP",
+    "date -r $UNIX_TIMESTAMP"
   ],
   "98": [
     "stat -f '%B' file | xargs date -r",
@@ -503,11 +585,15 @@
     "echo -n 'string' | base64",
     "printf 'string' | base64",
     "echo 'string' | base64",
-    "echo -n 'text' | base64"
+    "echo -n 'text' | base64",
+    "echo 'input string' | base64",
+    "echo 'your string' | base64",
+    "echo 'your_string' | base64"
   ],
   "100": [
     "env | grep PATH",
     "printenv | grep PATH",
-    "echo $PATH"
+    "echo $PATH",
+    "export PATH"
   ]
 }
\ No newline at end of file

From f337665e226c8535c02c53ae9aec66564cee09ad Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Fri, 17 Apr 2026 15:35:01 -0400
Subject: [PATCH 06/11] Add Mac QLoRA training, benchmark review criteria, and
 MPS adapter results

- QLoRA training on Mac via native Metal kernels (bitsandbytes PR #1875)
  ~34 min for 20 epochs on M3, 3.4GB GPU, ~7x slower than T4
- MPS GradScaler fix for fp16 gradient overflow
- Flat checkpoint format for export compatibility
- Benchmark review criteria documented in REVIEW_CRITERIA.md
- MPS adapter benchmark approaches added to run.py
- Updated alternates.json with manual review of 28 runs
- TRAINING.md rewritten: Mac + Colab paths, memory breakdowns, accuracy table
- Removed failed eval cells from notebooks
---
 benchmark/REVIEW_CRITERIA.md |  95 +++++++++++++++++++
 benchmark/alternates.json    |  29 ++++--
 benchmark/run.py             |  17 ++++
 training/TRAINING.md         | 175 ++++++++++++++++++++++++++++-------
 training/train_lora.ipynb    |  84 -----------------
 training/train_qlora.ipynb   |  78 +---------------
 training/train_qlora_full.py |  18 ++--
 7 files changed, 282 insertions(+), 214 deletions(-)
 create mode 100644 benchmark/REVIEW_CRITERIA.md

diff --git a/benchmark/REVIEW_CRITERIA.md b/benchmark/REVIEW_CRITERIA.md
new file mode 100644
index 0000000..8e9ac7e
--- /dev/null
+++ b/benchmark/REVIEW_CRITERIA.md
@@ -0,0 +1,95 @@
+# Benchmark Review Criteria
+
+Rules for deciding whether a non-exact result is "functionally correct" and should be added to alternates.json. Apply these consistently across all reviews.
+
+## ACCEPT — add to alternates.json
+
+### Placeholder variations
+Different placeholder names for the same command structure:
+- `file` vs `filename` vs `file.txt` — accept
+- `src dst` vs `source destination` vs `source_directory destination_directory` — accept
+- `user@host` vs `user@server` vs `username@remote_host` — accept
+- `example.com` vs `api.example.com` vs `localhost:8000` — accept
+
+### Quote style
+- Single vs double quotes: `'*.png'` vs `"*.png"` — accept
+- With or without quotes when not ambiguous: `-name .DS_Store` vs `-name '.DS_Store'` — accept
+
+### Flag reordering
+Same flags in different order:
+- `tar -xvzf` vs `tar -zxvf` — accept
+- `rsync -avz` vs `rsync -avzh` (extra harmless flag) — accept cautiously
+
+### Harmless extra flags
+Flags that don't change the core behavior:
+- `tar -czvf` (verbose) vs `tar -czf` — accept
+- `cp -R` vs `cp -r` (same on macOS) — accept
+- Adding `--progress` to rsync — accept
+
+### Format variations
+Same result, slightly different format:
+- `git log --oneline` vs `git log --pretty=oneline` — accept
+- `echo $SHELL` vs `echo $0` — accept (both show shell)
+
+## REJECT — do not add to alternates.json
+
+### Wrong command entirely
+- `system_profiler` for "monitor cpu usage" (should be `top`) — reject
+- `pbcopy` for "paste from clipboard" (that's copy, not paste) — reject
+- `cls` for "clear terminal" (Windows command) — reject
+
+### Wrong flags that change meaning
+- `find . -mtime -60` for "files changed in last hour" (`-mtime` is days, not minutes) — reject
+- `find . -mtime +1` for "files modified today" (opposite: MORE than 1 day ago) — reject
+- `head -50` for "last 50 lines" (head shows FIRST, not last) — reject
+- `tail -n 20` for "first 20 lines" (tail shows LAST, not first) — reject
+
+### Missing critical parts
+- `cp -r directory` (missing destination) — reject
+- `find .DS_Store -delete` (missing `.` path, only current dir entry) — reject
+- `zip -r .` (missing output filename) — reject
+- `ssh user@server` (missing `-i key` when prompt asks for specific key) — reject
+
+### Hallucinated commands/flags
+- `git log --no-pushed` (not a real flag) — reject
+- `git rename-branch` (not a real command) — reject
+- `find . -type symlink` (invalid type, should be `l`) — reject
+- `link -s` (not the same as `ln -s`) — reject
+- `zipdir`, `pylist`, `mcal` — reject
+
+### Broadened scope
+- `find . -empty` for "find empty directories" (also finds empty files) — reject
+- `find . -name node_modules` for "find directories named node_modules" (also finds files) — accept only with `-type d`
+- `git branch --merged | xargs git branch -d` without `grep -v main` (would delete main) — reject
+
+### Functionally different approach
+- `comm -12 <(sort file1) <(sort file2)` for "compare two files" (shows common lines, not differences) — reject
+- `du -sh /` for "show disk usage" (directory usage, not filesystem usage like `df`) — reject
+- `find . -name '*.py' | wc -l` for "count lines in python files" (counts FILES, not lines IN them) — reject
+
+### Piped through unnecessary commands
+- `cat file | head -20` for "first 20 lines" — accept (useless cat but correct)
+- `find ... | wc -l` when it should be `find ... -exec wc -l` — reject (counts files not lines)
+
+## EDGE CASES
+
+### `find . -empty` for "find empty directories"
+REJECT. `-empty` matches both empty files and directories. The prompt specifically asks for directories. Need `-type d -empty`.
+
+### `sips -s format jpeg input.jpg --out output.jpg` (same format in and out)
+REJECT. The prompt says "convert to different format." While the command structure is correct, the example converts jpg→jpg. Accept only if input and output formats differ.
+
+### `sips -s format jpg` (without `--out`)
+REJECT. `jpg` is not a valid sips format name (should be `jpeg`).
+
+### curl POST with different URLs/bodies
+ACCEPT if structure is correct: has `-X POST`, has `-H "Content-Type: application/json"`, has `-d`. Different URLs and body content are just placeholder variations.
+
+### rsync with `--delete`
+REJECT. Adding `--delete` removes files at destination that don't exist at source. That's a meaningfully different and potentially destructive operation.
+
+### `caffeinate -t 3600` for "prevent mac from sleeping"
+ACCEPT. Keeps awake for 1 hour — reasonable interpretation.
+
+### `env | grep PATH` vs `export PATH`
+ACCEPT both. Different mechanisms but both show PATH.
diff --git a/benchmark/alternates.json b/benchmark/alternates.json
index 1319f7d..f84dd53 100644
--- a/benchmark/alternates.json
+++ b/benchmark/alternates.json
@@ -111,7 +111,8 @@
     "wc -l **/*.py",
     "find . -name '*.py' -exec wc -l {} \\;",
     "find . -name \"*.py\" -exec wc -l {} +",
-    "find . -name \"*.py\" | xargs wc -l"
+    "find . -name \"*.py\" | xargs wc -l",
+    "wc -l *.py"
   ],
   "22": [
     "du -sh * | sort -hr",
@@ -154,7 +155,8 @@
     "tar -czf folder.tar.gz folder",
     "tar -czvf folder.tar.gz folder",
     "tar -czf path/to/compressed.tar.gz path/to/folder",
-    "tar -czvf /path/to/output.tar.gz /path/to/folder"
+    "tar -czvf /path/to/output.tar.gz /path/to/folder",
+    "tar -czf file.tar.gz ."
   ],
   "28": [
     "tar xzf file.tar.gz",
@@ -191,14 +193,16 @@
   "33": [
     "git log origin/main..HEAD",
     "git log origin/master..HEAD",
-    "git log --oneline origin/main..HEAD"
+    "git log --oneline origin/main..HEAD",
+    "git log origin/main..HEAD --oneline"
   ],
   "34": [
     "git branch -m old new",
     "git branch -m oldname newname",
     "git branch -m <branch_name> <new_branch_name>",
     "git branch -m old_branch_name new_branch_name",
-    "git branch -m new_branch_name"
+    "git branch -m new_branch_name",
+    "git branch -m branch_name1 branch_name2"
   ],
   "35": [
     "git branch --merged | grep -v main | xargs git branch -d",
@@ -255,7 +259,9 @@
   ],
   "42": [
     "tail -n 50 file",
-    "tail -50 file"
+    "tail -50 file",
+    "tail -50 filename",
+    "tail -n 50 filename"
   ],
   "43": [
     "ls | wc -l",
@@ -291,7 +297,9 @@
     "mkdir -p \"path/to/directory\"",
     "mkdir -p parent_directory_path",
     "mkdir -p /path/to/directory",
-    "mkdir -p path/to/directory"
+    "mkdir -p path/to/directory",
+    "mkdir -p directory",
+    "mkdir -p directory_name"
   ],
   "49": [
     "chmod +x file",
@@ -480,13 +488,15 @@
     "git cherry-pick commit",
     "git cherry-pick <commit-hash>",
     "git cherry-pick HEAD~1",
-    "git cherry-pick HEAD^"
+    "git cherry-pick HEAD^",
+    "git cherry-pick commit-hash"
   ],
   "85": [
     "ls -lh file",
     "ls -lh",
     "du -sh file",
-    "du -h file"
+    "du -h file",
+    "du -hs filename"
   ],
   "86": [
     "find . -perm 777",
@@ -547,7 +557,8 @@
     "zip -r file.zip .",
     "zip -r file.zip ./",
     "zip -r mydir.zip ./mydir",
-    "zip -r myfile.zip mydir"
+    "zip -r myfile.zip mydir",
+    "zip -r archive.zip directory_to_zip"
   ],
   "94": [
     "unzip file.zip -d directory",
diff --git a/benchmark/run.py b/benchmark/run.py
index 3bed7bd..6a1da41 100755
--- a/benchmark/run.py
+++ b/benchmark/run.py
@@ -618,6 +618,21 @@ def approach_qlora_override_retrieval(prompt):
 approach_qlora_override_retrieval._batch_args = ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH]
 
 
+QLORA_MPS_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_mps.fmadapter")
+
+
+def approach_qlora_mps_only(prompt):
+    """QLoRA trained on Mac (MPS), no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_MPS_ADAPTER_PATH, "--limit", "0"])
+approach_qlora_mps_only._batch_args = ["--adapter", QLORA_MPS_ADAPTER_PATH, "--limit", "0"]
+
+
+def approach_qlora_mps_retrieval(prompt):
+    """QLoRA trained on Mac (MPS) + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_MPS_ADAPTER_PATH])
+approach_qlora_mps_retrieval._batch_args = ["--adapter", QLORA_MPS_ADAPTER_PATH]
+
+
 def approach_dynshot_tldr(prompt):
     """Dynamic few-shot using tldr+overrides FTS5 index (21k entries)."""
     import sqlite3
@@ -698,6 +713,8 @@ def approach_dynshot_holdout(prompt):
     "qlora-retrieval": approach_qlora_retrieval,
     "qlora-override-only": approach_qlora_override_only,
     "qlora-override-retrieval": approach_qlora_override_retrieval,
+    "qlora-mps-only": approach_qlora_mps_only,
+    "qlora-mps-retrieval": approach_qlora_mps_retrieval,
     "lora-override-only": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]),
     "lora-override-retrieval": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH]),
     "hunch-sc": approach_hunch_sc,
diff --git a/training/TRAINING.md b/training/TRAINING.md
index 5fc782a..c31d762 100644
--- a/training/TRAINING.md
+++ b/training/TRAINING.md
@@ -6,16 +6,18 @@ How to train a LoRA adapter for Apple's on-device 3B Foundation Model using the
 
 1. **Apple Developer Program** ($99/year) — needed to download the training toolkit
 2. **Adapter training toolkit** — download from [developer.apple.com/apple-intelligence/foundation-models-adapter/](https://developer.apple.com/apple-intelligence/foundation-models-adapter/)
-3. **Google account** — for Colab (free tier works for fp16 LoRA, Pro/pay-as-you-go needed for standard LoRA)
+3. **Google account** — for Colab (free tier works for QLoRA and fp16 LoRA)
 
 ## Files
 
 ```
 training/
-├── train_lora.ipynb          # LoRA training notebook (needs A100)
+├── train_lora.ipynb              # LoRA training notebook (needs A100)
 ├── train_lora_fp16.ipynb         # fp16 LoRA training notebook (works on free T4)
-├── prepare_data.py           # Converts hunch bank → training JSONL
-└── README.md                 # Full experiment writeup and results
+├── train_qlora.ipynb             # QLoRA training notebook (works on free T4, recommended)
+├── train_qlora_full.py           # QLoRA training script (T4 or Mac)
+├── prepare_data.py               # Converts hunch bank → training JSONL
+└── README.md                     # Full experiment writeup and results
 ```
 
 ## Quick Start
@@ -32,34 +34,73 @@ training/adapter_training_toolkit_v26_0_0/
 └── requirements.txt
 ```
 
-### 2. Upload to Google Drive
+### 2. Choose your path
 
-Create `My Drive/hunch-training/` and upload:
+| Path | GPU | Cost | VRAM | Time (overrides) | Time (full bank) |
+|------|-----|------|------|------------------|------------------|
+| **QLoRA on Mac** | Apple Silicon | **Free, local** | **3.4GB** | **~34 min** | ~hours |
+| QLoRA on Colab | T4 16GB | Free | ~5GB | ~5 min | ~1.7 hours |
+| fp16 LoRA on Colab | T4 16GB | Free | ~8.5GB | ~10 min | ~2 hours |
+| LoRA on Colab | A100 40GB | Colab Pro ($10/mo) | ~15GB | ~5 min | ~1.5 hours |
+
+**QLoRA is recommended.** Same adapter quality as full LoRA, lowest memory, fewest patches. Mac training is ~7x slower than T4 but fully local.
+
+### Path A: Train on Mac (recommended for small datasets)
+
+```bash
+cd training/adapter_training_toolkit_v26_0_0
+source venv/bin/activate
+
+# Install native Metal kernel support for bitsandbytes
+pip install kernels
+pip install --force-reinstall git+https://github.com/bitsandbytes-foundation/bitsandbytes.git
+
+# Prepare data and train
+cd ..
+python3 prepare_data.py --sources override
+python3 train_qlora_full.py --epochs 20 --batch-size 8
+
+# Export (needs separate Python 3.12 env — coremltools doesn't support PyTorch 2.11)
+cd adapter_training_toolkit_v26_0_0
+python3.12 -m venv export-env
+source export-env/bin/activate
+pip install torch==2.5.0 coremltools==8.3.0
+python3 -m export.export_fmadapter \
+  --adapter-name hunch_qlora \
+  --checkpoint ../qlora-checkpoints/adapter-final.pt \
+  --output-dir ../qlora-checkpoints/
+```
+
+Notes:
+- Requires bitsandbytes from git main (pre-v0.50.0) with native MPS kernels (PR #1875)
+- The `kernels` package downloads pre-compiled Metal shaders from HuggingFace Hub at runtime
+- Don't use `bnb_4bit_use_double_quant=True` — not wired for MPS yet
+- ~34 min for 20 epochs on M3/M4, 3.4GB GPU, 0.2GB RAM
+
+### Path B: Train on Colab
+
+Upload to Google Drive:
 
 ```
-hunch-training/
+My Drive/hunch-training/
 ├── adapter_training_toolkit_v26_0_0/   # The extracted toolkit
 ├── prepare_data.py                      # From this directory
+├── train_qlora_full.py                  # From this directory (for QLoRA)
 ├── tldr_bank.db                         # From ../bank/
 └── prompts.jsonl                        # From ../benchmark/
 ```
 
-### 3. Choose your notebook
+Choose a notebook:
 
-| Notebook | GPU | Cost | Time | Patches needed |
-|----------|-----|------|------|----------------|
-| `train_lora.ipynb` | A100 40GB | Colab Pro ($10/mo) | ~1.5 hours | None |
-| `train_lora_fp16.ipynb` | T4 16GB | Free | ~2 hours | 3 patches (applied automatically) |
+| Notebook | GPU | Patches |
+|----------|-----|---------|
+| `train_qlora.ipynb` | T4 16GB (free) | 1 (rms_norm) |
+| `train_lora_fp16.ipynb` | T4 16GB (free) | 3 (mmap, grad scaling, rms_norm) |
+| `train_lora.ipynb` | A100 40GB (Pro) | None |
 
-### 4. Open in Colab
+Open in Colab via the VS Code extension or upload directly to [colab.research.google.com](https://colab.research.google.com). Run cells in order.
 
-Install the Google Colab extension in VS Code, open the notebook, select a Colab kernel with the appropriate GPU, and run the cells in order.
-
-Alternatively, upload the notebook to [colab.research.google.com](https://colab.research.google.com) directly.
-
-### 5. Test on-device
-
-Download the exported `.fmadapter` from Google Drive and test with hunch:
+### 3. Test on-device
 
 ```bash
 hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
@@ -70,8 +111,11 @@ hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
 `prepare_data.py` converts the hunch bank into training JSONL:
 
 ```bash
-python3 prepare_data.py         # generates train.jsonl + eval.jsonl
-python3 prepare_data.py --stats # show dataset statistics
+python3 prepare_data.py                        # full bank (~19k train / ~3k eval)
+python3 prepare_data.py --sources override     # overrides only (~96 examples, recommended)
+python3 prepare_data.py --sources tldr-osx     # macOS-specific tldr pages (~1k)
+python3 prepare_data.py --sources override,tldr-osx  # overrides + macOS (~1.1k)
+python3 prepare_data.py --stats                # show dataset statistics
 ```
 
 Each training example:
@@ -83,13 +127,37 @@ Each training example:
 ]
 ```
 
-- ~19k training / ~3k eval examples
 - Benchmark prompts excluded to avoid data leakage
 - Override and tldr-osx entries appear in both splits
 
-## fp16 LoRA Patches Explained
+**Use `--sources override` for best results.** Adapters trained on ~96 curated overrides (~5 min on T4) significantly outperform adapters trained on the full 19k bank (~1.7 hours on T4). Quality over quantity — see README.md for benchmark results.
+
+## How Each Approach Works
+
+### QLoRA (recommended)
+
+Quantizes the frozen base model to 4-bit NF4 via `bitsandbytes`, and uses `mmap=True` loading to avoid the 12GB CPU RAM spike. Only `nn.Linear` layers are quantized (attention Q/K/V/O, FFN — ~90% of params). Embeddings, norms, and other layers stay in fp16. Adapters train in fp32.
+
+Memory breakdown:
+- CPU RAM peak: **~1GB** (mmap reads weights from disk on demand)
+- Base model Linear layers: ~1.5GB (NF4)
+- Base model non-Linear: ~0.65GB (fp16)
+- Adapters + gradients + optimizer: ~0.6GB (fp32)
+- Activations: ~2-3GB
+- **GPU total: ~5GB**
+
+Only one patch needed: rms_norm dtype fix for mixed fp16/fp32/quantized tensors through norm layers.
+
+### fp16 LoRA
 
-The fp16 LoRA notebook (`train_lora_fp16.ipynb`) applies three patches to Apple's toolkit to fit training on a T4 (16GB GPU, 12GB system RAM):
+Forces the base model to fp16 and uses `mmap=True` loading. Both changes are patches to Apple's toolkit — the default loads fp32 without mmap, which requires ~24GB CPU RAM and 12GB GPU. Requires three patches total.
+
+Memory breakdown:
+- CPU RAM peak: **~1GB** (mmap, vs ~24GB without)
+- Base model: ~6GB (fp16, vs ~12GB fp32)
+- Adapters + gradients + optimizer: ~0.6GB (fp32)
+- Activations: ~2-3GB
+- **GPU total: ~8.5GB**
 
 **Patch 1 — `utils.py`: mmap + fp16 model + fp32 adapters**
 - `mmap=True` on `torch.load`: reads weights from disk on demand instead of loading 12GB into RAM
@@ -108,21 +176,39 @@ The fp16 LoRA notebook (`train_lora_fp16.ipynb`) applies three patches to Apple'
 
 All patches are applied automatically by the notebook. To restore originals, re-copy from the toolkit on Drive.
 
+### Standard LoRA
+
+Loads the base model in fp32. No patches needed but requires an A100 (40GB) — doesn't fit on a T4.
+
+Memory breakdown:
+- CPU RAM peak: **~24GB** during loading (12GB model + 12GB state dict simultaneously — no mmap)
+- Base model on GPU: ~12GB (fp32)
+- Adapters + gradients + optimizer: ~0.6GB (fp32)
+- Activations: ~2-3GB (fp32)
+- **GPU total: ~15GB**
+
+The CPU RAM spike is why standard LoRA OOMs on a 24GB Mac and on T4 (12GB system RAM). The A100's 80GB system RAM hides this. fp16 LoRA and QLoRA avoid this with `mmap=True` loading (~1GB RAM peak instead of 24GB).
+
 ## Export
 
 The export step packages the LoRA weights into a `.fmadapter` file that can be loaded on-device:
 
 ```bash
+cd adapter_training_toolkit_v26_0_0
 python3 -m export.export_fmadapter \
   --adapter-name hunch \
   --checkpoint ../checkpoints/adapter-final.pt \
   --output-dir ../exports/
 ```
 
+**Note for Mac training:** The training venv has PyTorch 2.11 (from bitsandbytes main) which is too new for coremltools. Export in a separate Python 3.12 environment — see Path A in Quick Start above.
+
 Output is ~127MB. The adapter name can only contain letters, numbers, and underscores.
 
 **Do not modify the export code** — the `.fmadapter` format must match exactly for on-device compatibility.
 
+The `.fmadapter` format doesn't record training precision — adapters trained via QLoRA, fp16 LoRA, or fp32 LoRA all export identically and load the same on-device.
+
 ## Loading in Swift
 
 ```swift
@@ -136,19 +222,44 @@ No entitlement needed for local testing. Entitlement required only for App Store
 
 ## Key Training Parameters
 
-| Parameter | LoRA (A100) | fp16 LoRA (T4) |
-|-----------|-------------|------------|
-| `--precision` | bf16-mixed | f16-mixed |
+| Parameter | Override-only (recommended) | Full bank |
+|-----------|---------------------------|-----------|
 | `--batch-size` | 8 | 8 |
 | `--learning-rate` | 1e-4 | 1e-4 |
-| `--epochs` | 3 | 3 |
-| `--activation-checkpointing` | yes | yes |
+| `--epochs` | 20 | 3 |
+| `--sources` (prepare_data.py) | `override` | (default) |
+
+These apply to all three approaches (LoRA, fp16 LoRA, QLoRA). Override-only trains on ~96 examples and needs more epochs to converge. Full bank has ~19k examples and overfits after 3.
+
+## On-Device Accuracy
 
-**Note:** lr=1e-3 diverges. Always use 1e-4.
+All three approaches produce comparable adapters. QLoRA is recommended — same quality, lowest cost.
+
+| Approach | + Retrieval | Standalone | Trained on |
+|---|---|---|---|
+| LoRA (A100) | ~85% | ~72.5% | T4/A100 |
+| QLoRA (T4) | ~83% | ~73% | T4 free |
+| QLoRA (Mac) | ~78.5% | ~72% | Local |
+| Retrieval only | ~79% | — | — |
+| Bare model | — | ~41% | — |
+
+Full benchmark details and analysis in README.md.
+
+## Known Issues
+
+### Adapter disk space leak
+
+`TGOnDeviceInferenceProviderService` caches a full copy of the adapter (~160MB) in a SIP-protected directory on every process invocation. The copies are never cleaned up. Running benchmarks (hundreds of adapter calls) can consume tens of GB invisibly.
+
+**Workaround:** Use `hunch --batch` to run multiple prompts in a single process (1 cached copy instead of 1 per prompt). To reclaim space, boot Recovery Mode and delete `/Volumes/Data/private/var/db/AppleIntelligencePlatform/AppModelAssets/*`.
+
+See `adapter-disk-leak-findings.md` for the full investigation.
 
 ## Troubleshooting
 
-**OOM on T4:** Make sure all three fp16 LoRA patches are applied. Run the patch cell before training.
+**OOM on T4 (QLoRA):** Make sure `bitsandbytes` is installed and the model is being quantized. Check for "Quantized 280 layers to NF4" in the output.
+
+**OOM on T4 (fp16 LoRA):** Make sure all three patches are applied. Run the patch cell before training.
 
 **loss = NaN:** The rms_norm patch didn't apply, or the pycache is stale. The notebook clears pycache automatically, but if you see NaN, restart the kernel and re-run from the patch cell.
 
diff --git a/training/train_lora.ipynb b/training/train_lora.ipynb
index 94371ad..d0b8702 100644
--- a/training/train_lora.ipynb
+++ b/training/train_lora.ipynb
@@ -310,90 +310,6 @@
     "!cp -r {WORK_DIR}/lora-override-exports {DRIVE_DIR}/lora-override-exports\n",
     "!echo 'Adapter exported and saved to Drive'"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 6. Evaluate"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "No output. Check error:\n",
-      "STDERR: ^^^^^^^\n",
-      "  File \"/content/hunch-training/adapter_training_toolkit_v26_0_0/examples/utils.py\", line 167, in load_base_model\n",
-      "    with Path(checkpoint_path).open(\"rb\") as f:\n",
-      "         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
-      "  File \"/usr/lib/python3.12/pathlib.py\", line 1013, in open\n",
-      "    return io.open(self, mode, buffering, encoding, errors, newline)\n",
-      "           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n",
-      "FileNotFoundError: [Errno 2] No such file or directory: '../checkpoints/adapter-final.pt'\n",
-      "\n",
-      "Return code: 1\n"
-     ]
-    }
-   ],
-   "source": [
-    "import json, subprocess\n",
-    "\n",
-    "test_prompts = [\n",
-    "    'find files changed in the last hour',\n",
-    "    'show disk usage',\n",
-    "    'generate a random password',\n",
-    "    'kill a process by name',\n",
-    "    'show http headers of a url',\n",
-    "    'record terminal session',\n",
-    "    'find files larger than 100mb',\n",
-    "    'convert image to different format',\n",
-    "    'show all listening ports',\n",
-    "    'find files modified in the last 7 days',\n",
-    "    'find files owned by root',\n",
-    "    'count lines in all python files',\n",
-    "    'show all environment variables',\n",
-    "    'clear the terminal',\n",
-    "    'compare two files',\n",
-    "]\n",
-    "\n",
-    "system = 'Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command.'\n",
-    "\n",
-    "with open(f'{WORK_DIR}/test_prompts.jsonl', 'w') as f:\n",
-    "    for p in test_prompts:\n",
-    "        f.write(json.dumps([\n",
-    "            {'role': 'system', 'content': system},\n",
-    "            {'role': 'user', 'content': p}\n",
-    "        ]) + '\\n')\n",
-    "\n",
-    "result = subprocess.run(\n",
-    "    ['python3', '-m', 'examples.generate',\n",
-    "     '--prompt', '../test_prompts.jsonl',\n",
-    "     '--checkpoint', '../checkpoints/adapter-final.pt',\n",
-    "     '--precision', 'bf16-mixed'],\n",
-    "    capture_output=True, text=True,\n",
-    "    cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
-    ")\n",
-    "\n",
-    "lines = (result.stdout + result.stderr).strip().split('\\n')\n",
-    "idx = 0\n",
-    "for line in lines:\n",
-    "    if 'Response for prompt' in line:\n",
-    "        answer = line.split(': ', 2)[-1].replace('<turn_end>', '').strip()\n",
-    "        prompt = test_prompts[idx] if idx < len(test_prompts) else '?'\n",
-    "        print(f'Q: {prompt:<45} A: {answer}')\n",
-    "        idx += 1\n",
-    "\n",
-    "if idx == 0:\n",
-    "    print('No output. Check error:')\n",
-    "    print('STDERR:', result.stderr[-500:])\n",
-    "    print('Return code:', result.returncode)"
-   ]
   }
  ],
  "metadata": {
diff --git a/training/train_qlora.ipynb b/training/train_qlora.ipynb
index ee9d0e5..570dfbd 100644
--- a/training/train_qlora.ipynb
+++ b/training/train_qlora.ipynb
@@ -277,83 +277,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 5. Evaluate"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "No output. Check error:\n",
-      "STDERR: \n",
-      "Return code: -9\n"
-     ]
-    }
-   ],
-   "source": [
-    "import json, subprocess\n",
-    "\n",
-    "test_prompts = [\n",
-    "    'find files changed in the last hour',\n",
-    "    'show disk usage',\n",
-    "    'generate a random password',\n",
-    "    'kill a process by name',\n",
-    "    'show http headers of a url',\n",
-    "    'record terminal session',\n",
-    "    'find files larger than 100mb',\n",
-    "    'convert image to different format',\n",
-    "    'show all listening ports',\n",
-    "    'find files modified in the last 7 days',\n",
-    "    'find files owned by root',\n",
-    "    'count lines in all python files',\n",
-    "    'show all environment variables',\n",
-    "    'clear the terminal',\n",
-    "    'compare two files',\n",
-    "]\n",
-    "\n",
-    "system = 'Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command.'\n",
-    "\n",
-    "with open(f'{WORK_DIR}/test_prompts.jsonl', 'w') as f:\n",
-    "    for p in test_prompts:\n",
-    "        f.write(json.dumps([\n",
-    "            {'role': 'system', 'content': system},\n",
-    "            {'role': 'user', 'content': p}\n",
-    "        ]) + '\\n')\n",
-    "\n",
-    "result = subprocess.run(\n",
-    "    ['python3', '-m', 'examples.generate',\n",
-    "     '--prompt', '../test_prompts.jsonl',\n",
-    "     '--checkpoint', '../qlora-checkpoints/adapter-final.pt',\n",
-    "     '--precision', 'f16-mixed'],\n",
-    "    capture_output=True, text=True,\n",
-    "    cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n",
-    ")\n",
-    "\n",
-    "lines = (result.stdout + result.stderr).strip().split('\\n')\n",
-    "idx = 0\n",
-    "for line in lines:\n",
-    "    if 'Response for prompt' in line:\n",
-    "        answer = line.split(': ', 2)[-1].replace('<turn_end>', '').strip()\n",
-    "        prompt = test_prompts[idx] if idx < len(test_prompts) else '?'\n",
-    "        print(f'Q: {prompt:<45} A: {answer}')\n",
-    "        idx += 1\n",
-    "\n",
-    "if idx == 0:\n",
-    "    print('No output. Check error:')\n",
-    "    print('STDERR:', result.stderr[-500:])\n",
-    "    print('Return code:', result.returncode)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 6. Export .fmadapter"
+    "## 5. Export .fmadapter"
    ]
   },
   {
diff --git a/training/train_qlora_full.py b/training/train_qlora_full.py
index ddf6187..975020d 100644
--- a/training/train_qlora_full.py
+++ b/training/train_qlora_full.py
@@ -134,7 +134,7 @@ def load_model_qlora(device):
     # Move to device
     model = model.to(device)
 
-    # Ensure adapter params are fp32
+    # Keep adapters in fp32 for stable training with gradient scaling
     for name, param in model.named_parameters():
         if param.requires_grad and param.dtype != torch.float32:
             param.data = param.data.float()
@@ -270,15 +270,9 @@ def evaluate(model, dataloader, device):
 
 
 def save_adapter_checkpoint(model, path, optimizer=None, epoch=None):
-    """Save adapter weights and optionally optimizer state for resume."""
-    checkpoint = {
-        "adapter_weights": {k: v.cpu() for k, v in model.state_dict().items() if "adapter" in k},
-    }
-    if optimizer:
-        checkpoint["optimizer"] = optimizer.state_dict()
-    if epoch is not None:
-        checkpoint["epoch"] = epoch
-    torch.save(checkpoint, path)
+    """Save adapter weights as flat state dict (compatible with export_fmadapter)."""
+    adapter_sd = {k: v.cpu() for k, v in model.state_dict().items() if "adapter" in k}
+    torch.save(adapter_sd, path)
     size_mb = os.path.getsize(path) / 1024**2
     print(f"Saved checkpoint ({size_mb:.0f}MB) to {path}")
 
@@ -338,8 +332,8 @@ def main():
         weight_decay=0.01
     )
 
-    # Gradient scaler for mixed precision on CUDA
-    scaler = torch.amp.GradScaler() if torch.cuda.is_available() else None
+    # Gradient scaler for mixed precision
+    scaler = torch.amp.GradScaler(device=str(device)) if (torch.cuda.is_available() or torch.backends.mps.is_available()) else None
 
     # Checkpoint dir
     os.makedirs(args.checkpoint_dir, exist_ok=True)

From 1870e00db85cffecb3dac705f90565906b48318b Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Fri, 17 Apr 2026 20:06:24 -0400
Subject: [PATCH 07/11] Clarify TRAINING.md: export env explanation, adapter
 size, Mac timing caveat

---
 training/TRAINING.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/training/TRAINING.md b/training/TRAINING.md
index c31d762..0a6fcf7 100644
--- a/training/TRAINING.md
+++ b/training/TRAINING.md
@@ -60,7 +60,9 @@ cd ..
 python3 prepare_data.py --sources override
 python3 train_qlora_full.py --epochs 20 --batch-size 8
 
-# Export (needs separate Python 3.12 env — coremltools doesn't support PyTorch 2.11)
+# Export — bitsandbytes from main pulls in PyTorch 2.11, but coremltools 8.3.0
+# ships native C extensions only for Python ≤3.13 and PyTorch ≤2.5.
+# Create a separate env with compatible versions:
 cd adapter_training_toolkit_v26_0_0
 python3.12 -m venv export-env
 source export-env/bin/activate
@@ -75,7 +77,7 @@ Notes:
 - Requires bitsandbytes from git main (pre-v0.50.0) with native MPS kernels (PR #1875)
 - The `kernels` package downloads pre-compiled Metal shaders from HuggingFace Hub at runtime
 - Don't use `bnb_4bit_use_double_quant=True` — not wired for MPS yet
-- ~34 min for 20 epochs on M3/M4, 3.4GB GPU, 0.2GB RAM
+- ~34 min for 20 epochs of 96 examples on M4, 3.4GB GPU, 0.2GB RAM. Full bank (~19k) would take hours
 
 ### Path B: Train on Colab
 
@@ -203,7 +205,7 @@ python3 -m export.export_fmadapter \
 
 **Note for Mac training:** The training venv has PyTorch 2.11 (from bitsandbytes main) which is too new for coremltools. Export in a separate Python 3.12 environment — see Path A in Quick Start above.
 
-Output is ~127MB. The adapter name can only contain letters, numbers, and underscores.
+Output is ~130MB. The adapter name can only contain letters, numbers, and underscores.
 
 **Do not modify the export code** — the `.fmadapter` format must match exactly for on-device compatibility.
 

From 6f6b47821c7bb9843e7d559fca65618f5547330b Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Fri, 17 Apr 2026 22:28:12 -0400
Subject: [PATCH 08/11] Fix label masking in QLoRA training and update
 benchmarks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Label masking was the main accuracy issue: the training loop computed
loss over prompt tokens, wasting adapter capacity. Now only assistant
response tokens contribute to the loss. This closed the MPS vs T4 gap
entirely — Mac-trained adapters now match T4 quality (~86% with retrieval).

Also: flat checkpoint format, conditional compress_statistics for MPS,
batch_size default 8, better logging granularity.
---
 benchmark/alternates.json    |  29 +++++--
 training/TRAINING.md         |   2 +-
 training/train_lora.ipynb    |   4 +-
 training/train_qlora.ipynb   | 156 ++++++++++++++---------------------
 training/train_qlora_full.py |  67 ++++++++-------
 5 files changed, 124 insertions(+), 134 deletions(-)

diff --git a/benchmark/alternates.json b/benchmark/alternates.json
index f84dd53..5079c1d 100644
--- a/benchmark/alternates.json
+++ b/benchmark/alternates.json
@@ -156,7 +156,8 @@
     "tar -czvf folder.tar.gz folder",
     "tar -czf path/to/compressed.tar.gz path/to/folder",
     "tar -czvf /path/to/output.tar.gz /path/to/folder",
-    "tar -czf file.tar.gz ."
+    "tar -czf file.tar.gz .",
+    "tar -czf archive.tar.gz folder"
   ],
   "28": [
     "tar xzf file.tar.gz",
@@ -171,7 +172,8 @@
     "tar -xvzf filename.tar.gz",
     "tar -xzvf file.tar.gz",
     "tar -zxvf archive.tar.gz",
-    "tar -zxvf file.tar.gz"
+    "tar -zxvf file.tar.gz",
+    "tar -xvf file.tar.gz"
   ],
   "29": [
     "git branch --sort=-committerdate",
@@ -228,7 +230,8 @@
     "curl -O https://example.com/file.txt",
     "curl -O https://example.com/file.zip",
     "curl -o file.zip https://example.com/file.zip",
-    "wget https://example.com/file.pdf"
+    "wget https://example.com/file.pdf",
+    "wget https://example.com/file.zip -O file.zip"
   ],
   "39": [
     "curl -I https://example.com",
@@ -389,7 +392,8 @@
   "65": [
     "caffeinate",
     "caffeinate -d",
-    "caffeinate -t 3600"
+    "caffeinate -t 3600",
+    "caffeinate -t 86400"
   ],
   "66": [
     "say hello",
@@ -520,7 +524,9 @@
     "ssh user@hostname -i path/to/key_file.pem",
     "ssh user@server -i path/to/key.pem",
     "ssh user@server -i path/to/key_file.pem",
-    "ssh user@server.example.com -i ~/.ssh/id_rsa"
+    "ssh user@server.example.com -i ~/.ssh/id_rsa",
+    "ssh user@host -i path/to/key",
+    "ssh user@host -i path/to/key_file.pem"
   ],
   "89": [
     "rsync -avz src/ user@host:dst/",
@@ -530,7 +536,9 @@
     "rsync -avz source_directory remote_server",
     "rsync -avz . remote_server:destination_directory",
     "rsync -avz --progress source_directory remote_server",
-    "rsync -avz --progress src_dir remote_server"
+    "rsync -avz --progress src_dir remote_server",
+    "rsync -avz /path/to/local/directory remote_server:destination_directory",
+    "rsync -avz /path/to/local/directory remote_server:path/to/remote/directory"
   ],
   "90": [
     "crontab -l"
@@ -539,7 +547,8 @@
     "grep -ri pattern .",
     "grep -ri 'pattern' .",
     "grep -rni pattern .",
-    "find . -type f -exec grep -ri 'pattern' {} +"
+    "find . -type f -exec grep -ri 'pattern' {} +",
+    "find . -type f -exec grep -ri 'pattern' +"
   ],
   "92": [
     "wc file",
@@ -558,7 +567,8 @@
     "zip -r file.zip ./",
     "zip -r mydir.zip ./mydir",
     "zip -r myfile.zip mydir",
-    "zip -r archive.zip directory_to_zip"
+    "zip -r archive.zip directory_to_zip",
+    "zip -r file.zip directory"
   ],
   "94": [
     "unzip file.zip -d directory",
@@ -574,7 +584,8 @@
     "unzip file.zip -d /path/to/unzip",
     "unzip file.zip -d destination",
     "unzip file.zip -d destination_directory",
-    "unzip file.zip -d /path/to/unzipped"
+    "unzip file.zip -d /path/to/unzipped",
+    "unzip file.zip -d destination/"
   ],
   "95": [
     "system_profiler SPHardwareDataType"
diff --git a/training/TRAINING.md b/training/TRAINING.md
index 0a6fcf7..37c3d82 100644
--- a/training/TRAINING.md
+++ b/training/TRAINING.md
@@ -41,7 +41,7 @@ training/adapter_training_toolkit_v26_0_0/
 | **QLoRA on Mac** | Apple Silicon | **Free, local** | **3.4GB** | **~34 min** | ~hours |
 | QLoRA on Colab | T4 16GB | Free | ~5GB | ~5 min | ~1.7 hours |
 | fp16 LoRA on Colab | T4 16GB | Free | ~8.5GB | ~10 min | ~2 hours |
-| LoRA on Colab | A100 40GB | Colab Pro ($10/mo) | ~15GB | ~5 min | ~1.5 hours |
+| LoRA on Colab | A100 40GB | Colab Pro ($10/mo) | ~15GB | ~5 min | ~2.5 hours |
 
 **QLoRA is recommended.** Same adapter quality as full LoRA, lowest memory, fewest patches. Mac training is ~7x slower than T4 but fully local.
 
diff --git a/training/train_lora.ipynb b/training/train_lora.ipynb
index d0b8702..2291e72 100644
--- a/training/train_lora.ipynb
+++ b/training/train_lora.ipynb
@@ -7,9 +7,7 @@
     "# LoRA: Training Apple's 3B Model on A100\n",
     "\n",
     "Standard LoRA training using Apple's adapter toolkit. Requires A100 (40GB GPU).\n",
-    "For free T4 training, see `train_qlora.ipynb`.\n",
-    "\n",
-    "**Results:** 3 epochs, ~2.5 hours. Loss: 1.94 → 1.17 → 0.61 (train), eval: 1.50 → 1.12 → 1.10"
+    "For free T4 training, see `train_qlora.ipynb`."
    ]
   },
   {
diff --git a/training/train_qlora.ipynb b/training/train_qlora.ipynb
index 570dfbd..1f6b5c9 100644
--- a/training/train_qlora.ipynb
+++ b/training/train_qlora.ipynb
@@ -7,7 +7,7 @@
     "# True QLoRA: Training Apple's 3B Model with 4-bit NF4\n",
     "\n",
     "Uses bitsandbytes NF4 quantization on the frozen base model.\n",
-    "Only ~5GB GPU memory — fits on free T4 with massive headroom.\n",
+    "Only ~5GB GPU memory — fits on free T4 with headroom.\n",
     "\n",
     "This is proper QLoRA as defined by [Dettmers et al. 2023](https://arxiv.org/abs/2305.14314):\n",
     "4-bit quantized base + fp32 LoRA adapters."
@@ -34,13 +34,13 @@
      "output_type": "stream",
      "text": [
       "Mounted at /content/drive\n",
-      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m62.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m\n",
-      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m33.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m9.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m84.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m12.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m19.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m39.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m58.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
       "\u001b[?25hCUDA: True\n",
       "GPU: Tesla T4\n"
      ]
@@ -70,15 +70,6 @@
     "    print(f'GPU: {torch.cuda.get_device_name(0)}')"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -88,7 +79,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
@@ -135,7 +126,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
@@ -155,88 +146,88 @@
       "Epoch 1/20\n",
       "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = c10::Half, Cannot dispatch to fused implementation. (Triggered internally at /pytorch/aten/src/ATen/native/layer_norm.cpp:344.)\n",
       "  return torch.rms_norm(input, normalized_shape, weight, eps)\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch1.pt\n",
-      "  Train loss: 3.4408 | Eval loss: 1.6959 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch1.pt\n",
+      "  Train loss: 1.4963 | Eval loss: 0.7162 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 2/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch2.pt\n",
-      "  Train loss: 0.8634 | Eval loss: 0.4109 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch2.pt\n",
+      "  Train loss: 0.5486 | Eval loss: 0.2153 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 3/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch3.pt\n",
-      "  Train loss: 0.3740 | Eval loss: 0.2714 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch3.pt\n",
+      "  Train loss: 0.1835 | Eval loss: 0.0547 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 4/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch4.pt\n",
-      "  Train loss: 0.2466 | Eval loss: 0.1851 | RAM=1.7GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch4.pt\n",
+      "  Train loss: 0.0840 | Eval loss: 0.0401 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 5/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch5.pt\n",
-      "  Train loss: 0.1741 | Eval loss: 0.1375 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch5.pt\n",
+      "  Train loss: 0.0463 | Eval loss: 0.0093 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 6/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch6.pt\n",
-      "  Train loss: 0.1382 | Eval loss: 0.1082 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch6.pt\n",
+      "  Train loss: 0.0166 | Eval loss: 0.0046 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 7/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch7.pt\n",
-      "  Train loss: 0.1167 | Eval loss: 0.1073 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch7.pt\n",
+      "  Train loss: 0.0043 | Eval loss: 0.0013 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 8/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch8.pt\n",
-      "  Train loss: 0.1093 | Eval loss: 0.0957 | RAM=1.7GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch8.pt\n",
+      "  Train loss: 0.0013 | Eval loss: 0.0003 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 9/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch9.pt\n",
-      "  Train loss: 0.1067 | Eval loss: 0.0935 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch9.pt\n",
+      "  Train loss: 0.0003 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 10/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch10.pt\n",
-      "  Train loss: 0.1007 | Eval loss: 0.0905 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch10.pt\n",
+      "  Train loss: 0.0001 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 11/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch11.pt\n",
-      "  Train loss: 0.0995 | Eval loss: 0.0953 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch11.pt\n",
+      "  Train loss: 0.0001 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 12/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch12.pt\n",
-      "  Train loss: 0.1022 | Eval loss: 0.0891 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch12.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 13/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch13.pt\n",
-      "  Train loss: 0.0969 | Eval loss: 0.0886 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch13.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 14/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch14.pt\n",
-      "  Train loss: 0.0951 | Eval loss: 0.0883 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch14.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 15/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch15.pt\n",
-      "  Train loss: 0.0939 | Eval loss: 0.0867 | RAM=1.7GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch15.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 16/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch16.pt\n",
-      "  Train loss: 0.0936 | Eval loss: 0.0867 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch16.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 17/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch17.pt\n",
-      "  Train loss: 0.0933 | Eval loss: 0.0861 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch17.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.8GB GPU=3.0GB\n",
       "\n",
       "Epoch 18/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch18.pt\n",
-      "  Train loss: 0.0946 | Eval loss: 0.0865 | RAM=1.5GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch18.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 19/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch19.pt\n",
-      "  Train loss: 0.0925 | Eval loss: 0.0866 | RAM=1.7GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch19.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
       "\n",
       "Epoch 20/20\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-epoch20.pt\n",
-      "  Train loss: 0.0928 | Eval loss: 0.0857 | RAM=1.5GB GPU=3.0GB\n",
-      "Saved 700 adapter weights (254MB) to qlora-override-checkpoints//adapter-final.pt\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch20.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.8GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-final.pt\n",
       "\n",
       "Done! Export with:\n",
-      "  python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint qlora-override-checkpoints//adapter-final.pt --output-dir qlora-override-checkpoints///\n"
+      "  python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint qlora-override-checkpoints/adapter-final.pt --output-dir qlora-override-checkpoints//\n"
      ]
     }
    ],
@@ -257,7 +248,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -282,7 +273,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
@@ -291,15 +282,15 @@
      "text": [
       "scikit-learn version 1.6.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.\n",
       "XGBoost version 3.2.0 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.\n",
-      "2026-04-15 16:39:35.262475: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "2026-04-18 01:46:36.123769: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
       "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
-      "E0000 00:00:1776271175.518226   51164 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-      "E0000 00:00:1776271175.586162   51164 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-      "W0000 00:00:1776271176.102805   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
-      "W0000 00:00:1776271176.102861   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
-      "W0000 00:00:1776271176.102866   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
-      "W0000 00:00:1776271176.102871   51164 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
-      "2026-04-15 16:39:36.148423: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "E0000 00:00:1776476796.352370    4085 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "E0000 00:00:1776476796.414439    4085 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "W0000 00:00:1776476796.851699    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776476796.851750    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776476796.851754    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776476796.851758    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "2026-04-18 01:46:36.891354: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
       "To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
       "TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.\n",
       "Torch version 2.10.0+cu128 has not been tested with coremltools. You may run into unexpected errors. Torch 2.5.0 is the most recent version that has been tested.\n",
@@ -312,7 +303,7 @@
       "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
       "WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'\n",
       "total 4.0K\n",
-      "drwxr-xr-x 2 root root 4.0K Apr 15 16:39 hunch_qlora.fmadapter\n",
+      "drwxr-xr-x 2 root root 4.0K Apr 18 01:46 hunch_qlora.fmadapter\n",
       "Adapter exported and saved to Drive\n"
      ]
     }
@@ -327,27 +318,6 @@
     "!cp -r {WORK_DIR}/qlora-override-exports {DRIVE_DIR}/qlora-override-exports\n",
     "!echo 'Adapter exported and saved to Drive'"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "total 4.0K\n",
-      "drwxr-xr-x 2 root root 4.0K Apr 15 15:49 hunch_qlora.fmadapter\n",
-      "Adapter exported and saved to Drive\n"
-     ]
-    }
-   ],
-   "source": [
-    "!ls -lh {WORK_DIR}/qlora-exports/\n",
-    "!cp -r {WORK_DIR}/qlora-exports {DRIVE_DIR}/qlora-exports2\n",
-    "!echo 'Adapter exported and saved to Drive'"
-   ]
   }
  ],
  "metadata": {
diff --git a/training/train_qlora_full.py b/training/train_qlora_full.py
index 975020d..944c5c5 100644
--- a/training/train_qlora_full.py
+++ b/training/train_qlora_full.py
@@ -116,7 +116,7 @@ def load_model_qlora(device):
         )
         new_module.weight = bnb.nn.Params4bit(
             module.weight.data, requires_grad=False,
-            quant_type="nf4", compress_statistics=True,
+            quant_type="nf4", compress_statistics=torch.cuda.is_available(),
         )
         if module.bias is not None:
             new_module.bias = module.bias
@@ -166,32 +166,42 @@ def __init__(self, path, tokenizer, max_length=512):
             for line in f:
                 messages = json.loads(line)
                 # Format: system + user + assistant
-                text = ""
+                prompt = ""
                 for msg in messages:
                     if msg["role"] == "system":
-                        text += f"system\n{msg['content']}<turn_end> "
+                        prompt += f"system\n{msg['content']}<turn_end> "
                     elif msg["role"] == "user":
-                        text += f"user\n {msg['content']}<turn_end> "
-                    elif msg["role"] == "assistant":
-                        text += f"assistant\n {msg['content']}<turn_end>"
-                self.examples.append(text)
+                        prompt += f"user\n {msg['content']}<turn_end> "
+                response = ""
+                for msg in messages:
+                    if msg["role"] == "assistant":
+                        response = f"assistant\n {msg['content']}<turn_end>"
+                full_text = prompt + response
+                prompt_len = len(tokenizer.encode(prompt))
+                self.examples.append((full_text, prompt_len))
 
     def __len__(self):
         return len(self.examples)
 
     def __getitem__(self, idx):
-        tokens = self.tokenizer.encode(self.examples[idx])
+        text, prompt_len = self.examples[idx]
+        tokens = self.tokenizer.encode(text)
         tokens = tokens[:self.max_length]
-        return torch.tensor(tokens, dtype=torch.long)
+        prompt_len = min(prompt_len, len(tokens))
+        return torch.tensor(tokens, dtype=torch.long), prompt_len
 
 
 def collate_fn(batch):
-    """Pad sequences to same length."""
-    max_len = max(len(x) for x in batch)
-    padded = torch.zeros(len(batch), max_len, dtype=torch.long)
-    for i, x in enumerate(batch):
-        padded[i, :len(x)] = x
-    return padded
+    """Pad sequences and create labels with masking for prompt and padding tokens."""
+    tokens_list, prompt_lens = zip(*batch)
+    max_len = max(len(x) for x in tokens_list)
+    input_ids = torch.zeros(len(tokens_list), max_len, dtype=torch.long)
+    labels = torch.full((len(tokens_list), max_len), -100, dtype=torch.long)
+    for i, (tokens, prompt_len) in enumerate(zip(tokens_list, prompt_lens)):
+        input_ids[i, :len(tokens)] = tokens
+        # Only compute loss on assistant response tokens (after prompt)
+        labels[i, prompt_len:len(tokens)] = tokens[prompt_len:]
+    return input_ids, labels
 
 
 def train_epoch(model, dataloader, optimizer, device, epoch, scaler=None):
@@ -200,23 +210,23 @@ def train_epoch(model, dataloader, optimizer, device, epoch, scaler=None):
     n_batches = 0
     start = time.time()
 
-    for i, batch in enumerate(dataloader):
-        input_ids = batch.to(device)
-        labels = input_ids.clone()
+    for i, (input_ids, labels) in enumerate(dataloader):
+        input_ids = input_ids.to(device)
+        labels = labels.to(device)
 
-        # Forward
+        # Forward — labels have -100 for prompt and padding tokens (ignored by CrossEntropyLoss)
         if scaler:
             with torch.amp.autocast(device_type=str(device), dtype=torch.float16):
                 output = model(input_ids)
                 logits = output.logits if hasattr(output, 'logits') else output
-                loss = nn.CrossEntropyLoss()(
+                loss = nn.CrossEntropyLoss(ignore_index=-100)(
                     logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
                     labels[:, 1:].contiguous().view(-1)
                 )
         else:
             output = model(input_ids)
             logits = output.logits if hasattr(output, 'logits') else output
-            loss = nn.CrossEntropyLoss()(
+            loss = nn.CrossEntropyLoss(ignore_index=-100)(
                 logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
                 labels[:, 1:].contiguous().view(-1)
             )
@@ -237,12 +247,13 @@ def train_epoch(model, dataloader, optimizer, device, epoch, scaler=None):
         total_loss += loss.item()
         n_batches += 1
 
-        if (i + 1) % 100 == 0:
+        log_every = 10 if len(dataloader) < 100 else 100
+        if (i + 1) % log_every == 0:
             avg = total_loss / n_batches
             elapsed = time.time() - start
             it_s = (i + 1) / elapsed
             remaining = (len(dataloader) - i - 1) / it_s / 60
-            print(f"  [{i+1}/{len(dataloader)}] loss={avg:.3f} {it_s:.1f}it/s ~{remaining:.0f}min left | {mem_str()}")
+            print(f"  [{i+1}/{len(dataloader)}] loss={avg:.3f} {it_s:.2f}it/s ~{remaining:.0f}min left | {mem_str()}")
 
     return total_loss / max(n_batches, 1)
 
@@ -253,13 +264,13 @@ def evaluate(model, dataloader, device):
     n_batches = 0
 
     with torch.no_grad():
-        for batch in dataloader:
-            input_ids = batch.to(device)
-            labels = input_ids.clone()
+        for input_ids, labels in dataloader:
+            input_ids = input_ids.to(device)
+            labels = labels.to(device)
             with torch.amp.autocast(device_type=str(device), dtype=torch.float16):
                 output = model(input_ids)
                 logits = output.logits if hasattr(output, 'logits') else output
-                loss = nn.CrossEntropyLoss()(
+                loss = nn.CrossEntropyLoss(ignore_index=-100)(
                     logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
                     labels[:, 1:].contiguous().view(-1)
                 )
@@ -280,7 +291,7 @@ def save_adapter_checkpoint(model, path, optimizer=None, epoch=None):
 def main():
     parser = argparse.ArgumentParser(description="QLoRA training for hunch")
     parser.add_argument("--epochs", type=int, default=3)
-    parser.add_argument("--batch-size", type=int, default=4)
+    parser.add_argument("--batch-size", type=int, default=8)
     parser.add_argument("--learning-rate", type=float, default=1e-4)
     parser.add_argument("--train-data", default=str(TRAINING_DIR / "train.jsonl"))
     parser.add_argument("--eval-data", default=str(TRAINING_DIR / "eval.jsonl"))

From a90c867e32c72c210c1dd1bc57bb1aba6776bf07 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Sat, 18 Apr 2026 17:26:24 -0400
Subject: [PATCH 09/11] Add MPS benchmark script, update TRAINING.md with
 accurate numbers

- bench_mps.py: structured benchmark for Metal vs CPU fallback comparison
- TRAINING.md: ~5GB GPU peak (not 3.4GB), LoRA T4 OOM is system RAM not GPU,
  accuracy table updated with latest results
- train_qlora_full.py: log every 20 steps instead of 100 for shorter runs
---
 training/TRAINING.md         |  12 +--
 training/bench_mps.py        | 191 +++++++++++++++++++++++++++++++++++
 training/train_qlora_full.py |   2 +-
 3 files changed, 198 insertions(+), 7 deletions(-)
 create mode 100644 training/bench_mps.py

diff --git a/training/TRAINING.md b/training/TRAINING.md
index 37c3d82..89cc929 100644
--- a/training/TRAINING.md
+++ b/training/TRAINING.md
@@ -38,7 +38,7 @@ training/adapter_training_toolkit_v26_0_0/
 
 | Path | GPU | Cost | VRAM | Time (overrides) | Time (full bank) |
 |------|-----|------|------|------------------|------------------|
-| **QLoRA on Mac** | Apple Silicon | **Free, local** | **3.4GB** | **~34 min** | ~hours |
+| **QLoRA on Mac** | Apple Silicon | **Free, local** | **~5GB** | **~34 min** | ~hours |
 | QLoRA on Colab | T4 16GB | Free | ~5GB | ~5 min | ~1.7 hours |
 | fp16 LoRA on Colab | T4 16GB | Free | ~8.5GB | ~10 min | ~2 hours |
 | LoRA on Colab | A100 40GB | Colab Pro ($10/mo) | ~15GB | ~5 min | ~2.5 hours |
@@ -77,7 +77,7 @@ Notes:
 - Requires bitsandbytes from git main (pre-v0.50.0) with native MPS kernels (PR #1875)
 - The `kernels` package downloads pre-compiled Metal shaders from HuggingFace Hub at runtime
 - Don't use `bnb_4bit_use_double_quant=True` — not wired for MPS yet
-- ~34 min for 20 epochs of 96 examples on M4, 3.4GB GPU, 0.2GB RAM. Full bank (~19k) would take hours
+- ~34 min for 20 epochs of 96 examples on M4, ~5GB GPU peak. Full bank (~19k) would take hours
 
 ### Path B: Train on Colab
 
@@ -180,7 +180,7 @@ All patches are applied automatically by the notebook. To restore originals, re-
 
 ### Standard LoRA
 
-Loads the base model in fp32. No patches needed but requires an A100 (40GB) — doesn't fit on a T4.
+Loads the base model in fp32. No patches needed. The ~15GB GPU footprint barely fits a T4 (16GB) with no headroom, but loading crashes first — the 12GB checkpoint must be fully loaded into CPU RAM alongside the model, peaking at ~24GB. T4 only has 12GB system RAM. The A100 works because it has 80GB system RAM.
 
 Memory breakdown:
 - CPU RAM peak: **~24GB** during loading (12GB model + 12GB state dict simultaneously — no mmap)
@@ -239,9 +239,9 @@ All three approaches produce comparable adapters. QLoRA is recommended — same
 
 | Approach | + Retrieval | Standalone | Trained on |
 |---|---|---|---|
-| LoRA (A100) | ~85% | ~72.5% | T4/A100 |
-| QLoRA (T4) | ~83% | ~73% | T4 free |
-| QLoRA (Mac) | ~78.5% | ~72% | Local |
+| QLoRA (Mac) | ~86% | ~76% | Local |
+| QLoRA (T4) | ~85% | ~74% | T4 free |
+| LoRA (A100) | ~85% | ~72.5% | A100 |
 | Retrieval only | ~79% | — | — |
 | Bare model | — | ~41% | — |
 
diff --git a/training/bench_mps.py b/training/bench_mps.py
new file mode 100644
index 0000000..a1bd6c4
--- /dev/null
+++ b/training/bench_mps.py
@@ -0,0 +1,191 @@
+#!/usr/bin/env python3
+"""
+Benchmark QLoRA training on MPS: Metal kernels vs CPU fallback.
+
+Measures load time, training throughput, and memory usage.
+Run with both bitsandbytes versions to compare:
+
+  # With Metal kernels (bitsandbytes from main)
+  python3 bench_mps.py --epochs 3 --label metal
+
+  # Without Metal kernels (bitsandbytes 0.49.2)
+  python3 bench_mps.py --epochs 3 --label cpu-fallback
+
+  # Longer sequences (override + tldr-osx)
+  python3 bench_mps.py --epochs 3 --sources override,tldr-osx --label metal-long
+
+Results are appended to bench_mps_results.jsonl for comparison.
+"""
+
+import sys
+import os
+import gc
+import json
+import time
+import argparse
+import psutil
+from pathlib import Path
+
+TOOLKIT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "adapter_training_toolkit_v26_0_0")
+sys.path.insert(0, TOOLKIT_DIR)
+TRAINING_DIR = Path(__file__).parent
+
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+
+
+def mem_stats():
+    ram = psutil.Process().memory_info().rss / 1024**3
+    gpu = 0
+    if torch.backends.mps.is_available():
+        gpu = torch.mps.current_allocated_memory() / 1024**3
+    elif torch.cuda.is_available():
+        gpu = torch.cuda.memory_allocated() / 1024**3
+    cpu_pct = psutil.cpu_percent(interval=None)
+    return {"ram_gb": round(ram, 2), "gpu_gb": round(gpu, 2), "cpu_pct": cpu_pct}
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--epochs", type=int, default=3)
+    parser.add_argument("--batch-size", type=int, default=8)
+    parser.add_argument("--sources", default="override")
+    parser.add_argument("--label", required=True, help="Label for this run (e.g. 'metal', 'cpu-fallback')")
+    parser.add_argument("--repeat", type=int, default=1, help="Number of full runs to average")
+    args = parser.parse_args()
+
+    # Check bitsandbytes version
+    import bitsandbytes as bnb
+    bnb_version = getattr(bnb, '__version__', 'unknown')
+    print(f"bitsandbytes: {bnb_version}")
+    print(f"Label: {args.label}")
+    print(f"Sources: {args.sources}")
+    print(f"Epochs: {args.epochs}, Batch: {args.batch_size}, Repeats: {args.repeat}")
+    print()
+
+    # Prepare data if needed
+    train_path = TRAINING_DIR / "train.jsonl"
+    if not train_path.exists():
+        os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py --sources {args.sources}")
+    else:
+        # Regenerate with correct sources
+        os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py --sources {args.sources}")
+
+    # Import training components
+    from train_qlora_full import (
+        CommandDataset, collate_fn, load_model_qlora, patch_rms_norm,
+        train_epoch, evaluate
+    )
+    from tamm.tokenizers.afm import AFMTokenizer
+
+    results = []
+
+    for run in range(1, args.repeat + 1):
+        print(f"{'='*60}")
+        print(f"  Run {run}/{args.repeat}")
+        print(f"{'='*60}")
+
+        # Start CPU monitoring
+        psutil.cpu_percent(interval=None)  # reset
+
+        # Phase 1: Load & quantize
+        t_load_start = time.time()
+        patch_rms_norm()
+        device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda")
+        model = load_model_qlora(device)
+        t_load = time.time() - t_load_start
+        mem_after_load = mem_stats()
+        print(f"  Load+quantize: {t_load:.1f}s | {mem_after_load}")
+
+        # Phase 2: Setup data
+        tokenizer = AFMTokenizer(str(Path(TOOLKIT_DIR) / "assets" / "tokenizer.model"))
+        train_dataset = CommandDataset(str(train_path), tokenizer)
+        eval_dataset = CommandDataset(str(TRAINING_DIR / "eval.jsonl"), tokenizer)
+        train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, collate_fn=collate_fn)
+        eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn)
+        print(f"  Data: {len(train_dataset)} train, {len(eval_dataset)} eval, {len(train_loader)} batches/epoch")
+
+        # Phase 3: Train
+        optimizer = torch.optim.AdamW(
+            [p for p in model.parameters() if p.requires_grad],
+            lr=1e-4, weight_decay=0.01
+        )
+        scaler = torch.amp.GradScaler(device=str(device)) if (torch.cuda.is_available() or torch.backends.mps.is_available()) else None
+
+        epoch_times = []
+        epoch_losses = []
+        mem_during_training = []
+
+        for epoch in range(args.epochs):
+            t_epoch_start = time.time()
+            train_loss = train_epoch(model, train_loader, optimizer, device, epoch, scaler)
+            t_epoch = time.time() - t_epoch_start
+            epoch_times.append(t_epoch)
+            epoch_losses.append(train_loss)
+            mem = mem_stats()
+            mem_during_training.append(mem)
+
+            batches = len(train_loader)
+            it_s = batches / t_epoch
+            s_it = t_epoch / batches
+            print(f"  Epoch {epoch+1}: {t_epoch:.1f}s ({s_it:.2f}s/it, {it_s:.2f}it/s) loss={train_loss:.4f} | {mem}")
+
+        # Phase 4: Eval
+        t_eval_start = time.time()
+        eval_loss = evaluate(model, eval_loader, device)
+        t_eval = time.time() - t_eval_start
+        print(f"  Eval: {t_eval:.1f}s loss={eval_loss:.4f}")
+
+        total_time = t_load + sum(epoch_times) + t_eval
+        avg_epoch = sum(epoch_times) / len(epoch_times)
+        avg_it_s = len(train_loader) / avg_epoch
+        avg_s_it = avg_epoch / len(train_loader)
+
+        run_result = {
+            "label": args.label,
+            "run": run,
+            "bnb_version": bnb_version,
+            "sources": args.sources,
+            "epochs": args.epochs,
+            "batch_size": args.batch_size,
+            "train_examples": len(train_dataset),
+            "batches_per_epoch": len(train_loader),
+            "load_time_s": round(t_load, 1),
+            "avg_epoch_s": round(avg_epoch, 1),
+            "avg_s_per_it": round(avg_s_it, 2),
+            "avg_it_per_s": round(avg_it_s, 2),
+            "total_time_s": round(total_time, 1),
+            "final_train_loss": round(epoch_losses[-1], 4),
+            "eval_loss": round(eval_loss, 4),
+            "mem_after_load": mem_after_load,
+            "mem_training": mem_during_training[-1],
+            "epoch_times": [round(t, 1) for t in epoch_times],
+        }
+        results.append(run_result)
+
+        print(f"\n  Summary: {avg_s_it:.2f}s/it ({avg_it_s:.2f}it/s), total {total_time:.0f}s")
+        print()
+
+        # Cleanup for next run
+        del model, optimizer, scaler, train_loader, eval_loader
+        gc.collect()
+        if torch.backends.mps.is_available():
+            torch.mps.empty_cache()
+
+    # Save results
+    results_file = TRAINING_DIR / "bench_mps_results.jsonl"
+    with open(results_file, "a") as f:
+        for r in results:
+            f.write(json.dumps(r) + "\n")
+    print(f"Results appended to {results_file}")
+
+    # Print comparison-ready summary
+    if len(results) > 1:
+        avg_it = sum(r["avg_s_per_it"] for r in results) / len(results)
+        avg_total = sum(r["total_time_s"] for r in results) / len(results)
+        print(f"\nAverage across {len(results)} runs: {avg_it:.2f}s/it, {avg_total:.0f}s total")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/training/train_qlora_full.py b/training/train_qlora_full.py
index 944c5c5..780113f 100644
--- a/training/train_qlora_full.py
+++ b/training/train_qlora_full.py
@@ -247,7 +247,7 @@ def train_epoch(model, dataloader, optimizer, device, epoch, scaler=None):
         total_loss += loss.item()
         n_batches += 1
 
-        log_every = 10 if len(dataloader) < 100 else 100
+        log_every = 10 if len(dataloader) < 50 else 20
         if (i + 1) % log_every == 0:
             avg = total_loss / n_batches
             elapsed = time.time() - start

From c6cfa96372f64e6ad854df362054e05d771ccc58 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Sun, 19 Apr 2026 14:12:16 -0400
Subject: [PATCH 10/11] Update readme

---
 README.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index ee401a8..ff70ec5 100644
--- a/README.md
+++ b/README.md
@@ -173,13 +173,15 @@ This clones [tldr-pages](https://github.com/tldr-pages/tldr), parses all entries
 
 ## LoRA Adapter Training (experimental)
 
-The `training/` directory contains infrastructure for fine-tuning Apple's on-device 3B model using LoRA adapters. This is an exploration of whether fine-tuning can complement or replace the retrieval approach — see `training/README.md` for full details, results, and notebooks.
+The `training/` directory contains infrastructure for fine-tuning Apple's on-device 3B model using LoRA adapters. QLoRA training works on a free Colab T4 or locally on a 24GB Mac. See `training/README.md` for full details, results, and notebooks.
 
 ```bash
 hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
 ```
 
-Current finding: retrieval alone (~83%) still outperforms adapter-based approaches, but the training pipeline and QLoRA patches (enabling training on free Colab T4 GPUs) may be useful to others exploring Apple's adapter toolkit.
+Current finding: adapter + retrieval reaches ~86% accuracy (vs ~79% retrieval alone). QLoRA matches full LoRA quality, and Mac-trained adapters match T4-trained.
+
+> **Known bug (as of April 2026):** Apple's `TGOnDeviceInferenceProviderService` caches a full copy of the adapter (~160MB) on every CLI invocation and never cleans up. Repeated adapter calls from CLI tools can consume significant disk space. Apple has confirmed this as a known bug specific to CLI tools. See `training/adapter-disk-leak-findings.md` for details and workaround.
 
 ## Known limitations
 

From 7d944302734b62e8af50b66a3bbd1f34c480d7f8 Mon Sep 17 00:00:00 2001
From: Enrico Santagati <ge.santagati@gmail.com>
Date: Sun, 19 Apr 2026 14:36:57 -0400
Subject: [PATCH 11/11] Fix broken references, indentation, and stale docs
 before merge

- README: link to TRAINING.md instead of gitignored README.md
- TRAINING.md: inline disk leak workaround instead of referencing
  uncommitted file, update file listing, fix GPU number
- main.swift: fix batch loop indentation
- .gitignore: exclude bench_mps_results.jsonl
---
 .gitignore                   |   1 +
 README.md                    |   2 +-
 cli/Sources/Hunch/main.swift | 154 +++++++++++++++++------------------
 training/TRAINING.md         |   6 +-
 4 files changed, 83 insertions(+), 80 deletions(-)

diff --git a/.gitignore b/.gitignore
index 3e783e0..d8ded18 100644
--- a/.gitignore
+++ b/.gitignore
@@ -22,3 +22,4 @@ training/adapter_training_toolkit*
 training/README.md
 training/exports/
 training/qlora-checkpoints/
+training/bench_mps_results.jsonl
diff --git a/README.md b/README.md
index ff70ec5..480f58f 100644
--- a/README.md
+++ b/README.md
@@ -173,7 +173,7 @@ This clones [tldr-pages](https://github.com/tldr-pages/tldr), parses all entries
 
 ## LoRA Adapter Training (experimental)
 
-The `training/` directory contains infrastructure for fine-tuning Apple's on-device 3B model using LoRA adapters. QLoRA training works on a free Colab T4 or locally on a 24GB Mac. See `training/README.md` for full details, results, and notebooks.
+The `training/` directory contains infrastructure for fine-tuning Apple's on-device 3B model using LoRA adapters. QLoRA training works on a free Colab T4 or locally on a 24GB Mac. See `training/TRAINING.md` for full details, results, and notebooks.
 
 ```bash
 hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
diff --git a/cli/Sources/Hunch/main.swift b/cli/Sources/Hunch/main.swift
index 09ce597..583d0b2 100644
--- a/cli/Sources/Hunch/main.swift
+++ b/cli/Sources/Hunch/main.swift
@@ -439,93 +439,93 @@ struct Hunch {
         let dbPath = findDatabase()
 
         for run in 1...runs {
-        for line in lines {
-            guard let data = line.data(using: .utf8),
-                  let entry = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
-                  let idValue = entry["id"], let id = idValue as? Int ?? (idValue as? NSNumber)?.intValue,
-                  let prompt = entry["prompt"] as? String else {
-                continue
-            }
-
-            let start = CFAbsoluteTimeGetCurrent()
-            var result: String
-
-            do {
-                let examples = dbPath != nil
-                    ? searchBank(dbPath: dbPath!, query: prompt, limit: limit)
-                    : []
-                let systemPrompt = buildSystemPrompt(mode: .suggest, examples: examples)
-
-                let session: LanguageModelSession
-                if !systemPrompt.isEmpty {
-                    let segment = Transcript.TextSegment(content: systemPrompt)
-                    let instructions = Transcript.Instructions(
-                        segments: [.text(segment)],
-                        toolDefinitions: []
-                    )
-                    session = LanguageModelSession(
-                        model: model,
-                        transcript: Transcript(entries: [.instructions(instructions)])
-                    )
-                } else {
-                    session = LanguageModelSession(model: model)
+            for line in lines {
+                guard let data = line.data(using: .utf8),
+                      let entry = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
+                      let idValue = entry["id"], let id = idValue as? Int ?? (idValue as? NSNumber)?.intValue,
+                      let prompt = entry["prompt"] as? String else {
+                    continue
                 }
 
-                if guided == "plain" {
-                    let response: LanguageModelSession.Response<ShellCommand>
-                    if let opts = genOptions {
-                        response = try await session.respond(to: prompt, generating: ShellCommand.self, options: opts)
-                    } else {
-                        response = try await session.respond(to: prompt, generating: ShellCommand.self)
-                    }
-                    result = response.content.command
-                } else if guided == "cot" {
-                    let response: LanguageModelSession.Response<ShellCommandCoT>
-                    if let opts = genOptions {
-                        response = try await session.respond(to: prompt, generating: ShellCommandCoT.self, options: opts)
-                    } else {
-                        response = try await session.respond(to: prompt, generating: ShellCommandCoT.self)
-                    }
-                    result = response.content.command
-                } else if guided == "multi" {
-                    let response: LanguageModelSession.Response<ShellCommandMulti>
-                    if let opts = genOptions {
-                        response = try await session.respond(to: prompt, generating: ShellCommandMulti.self, options: opts)
+                let start = CFAbsoluteTimeGetCurrent()
+                var result: String
+
+                do {
+                    let examples = dbPath != nil
+                        ? searchBank(dbPath: dbPath!, query: prompt, limit: limit)
+                        : []
+                    let systemPrompt = buildSystemPrompt(mode: .suggest, examples: examples)
+
+                    let session: LanguageModelSession
+                    if !systemPrompt.isEmpty {
+                        let segment = Transcript.TextSegment(content: systemPrompt)
+                        let instructions = Transcript.Instructions(
+                            segments: [.text(segment)],
+                            toolDefinitions: []
+                        )
+                        session = LanguageModelSession(
+                            model: model,
+                            transcript: Transcript(entries: [.instructions(instructions)])
+                        )
                     } else {
-                        response = try await session.respond(to: prompt, generating: ShellCommandMulti.self)
+                        session = LanguageModelSession(model: model)
                     }
-                    result = majorityVote([response.content.first, response.content.second, response.content.third])
-                } else {
-                    // Default: plain string
-                    let response: LanguageModelSession.Response<String>
-                    if let opts = genOptions {
-                        response = try await session.respond(to: prompt, options: opts)
+
+                    if guided == "plain" {
+                        let response: LanguageModelSession.Response<ShellCommand>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, generating: ShellCommand.self, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt, generating: ShellCommand.self)
+                        }
+                        result = response.content.command
+                    } else if guided == "cot" {
+                        let response: LanguageModelSession.Response<ShellCommandCoT>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, generating: ShellCommandCoT.self, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt, generating: ShellCommandCoT.self)
+                        }
+                        result = response.content.command
+                    } else if guided == "multi" {
+                        let response: LanguageModelSession.Response<ShellCommandMulti>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, generating: ShellCommandMulti.self, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt, generating: ShellCommandMulti.self)
+                        }
+                        result = majorityVote([response.content.first, response.content.second, response.content.third])
                     } else {
-                        response = try await session.respond(to: prompt)
+                        // Default: plain string
+                        let response: LanguageModelSession.Response<String>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt)
+                        }
+                        result = stripMarkdown(response.content)
                     }
-                    result = stripMarkdown(response.content)
+                } catch {
+                    result = "[ERROR] \(error.localizedDescription)"
                 }
-            } catch {
-                result = "[ERROR] \(error.localizedDescription)"
-            }
 
-            let elapsed = round((CFAbsoluteTimeGetCurrent() - start) * 100) / 100
-            var output: [String: Any] = [
-                "id": id,
-                "prompt": prompt,
-                "result": result,
-                "total_time": elapsed
-            ]
-            if runs > 1 {
-                output["run"] = run
-            }
-            if let jsonData = try? JSONSerialization.data(withJSONObject: output),
-               let jsonString = String(data: jsonData, encoding: .utf8) {
-                print(jsonString)
-                fflush(stdout)
+                let elapsed = round((CFAbsoluteTimeGetCurrent() - start) * 100) / 100
+                var output: [String: Any] = [
+                    "id": id,
+                    "prompt": prompt,
+                    "result": result,
+                    "total_time": elapsed
+                ]
+                if runs > 1 {
+                    output["run"] = run
+                }
+                if let jsonData = try? JSONSerialization.data(withJSONObject: output),
+                   let jsonString = String(data: jsonData, encoding: .utf8) {
+                    print(jsonString)
+                    fflush(stdout)
+                }
             }
         }
-        } // end runs loop
     }
 
     static func printUsage() {
diff --git a/training/TRAINING.md b/training/TRAINING.md
index 89cc929..b3c0b88 100644
--- a/training/TRAINING.md
+++ b/training/TRAINING.md
@@ -16,8 +16,10 @@ training/
 ├── train_lora_fp16.ipynb         # fp16 LoRA training notebook (works on free T4)
 ├── train_qlora.ipynb             # QLoRA training notebook (works on free T4, recommended)
 ├── train_qlora_full.py           # QLoRA training script (T4 or Mac)
+├── train_qlora_test.py           # Quick smoke test (load model, one forward/backward pass)
 ├── prepare_data.py               # Converts hunch bank → training JSONL
-└── README.md                     # Full experiment writeup and results
+├── bench_mps.py                  # Metal vs CPU fallback benchmark
+└── TRAINING.md                   # This file
 ```
 
 ## Quick Start
@@ -255,7 +257,7 @@ Full benchmark details and analysis in README.md.
 
 **Workaround:** Use `hunch --batch` to run multiple prompts in a single process (1 cached copy instead of 1 per prompt). To reclaim space, boot Recovery Mode and delete `/Volumes/Data/private/var/db/AppleIntelligencePlatform/AppModelAssets/*`.
 
-See `adapter-disk-leak-findings.md` for the full investigation.
+To reclaim space, boot Recovery Mode and run `rm -rf /Volumes/Data/private/var/db/AppleIntelligencePlatform/AppModelAssets/*`. The service recreates what it needs on the next adapter load.
 
 ## Troubleshooting