diff --git a/.gitignore b/.gitignore
index 4a7805a..d8ded18 100644
--- a/.gitignore
+++ b/.gitignore
@@ -16,3 +16,10 @@ __pycache__/
 *.tar.gz
 CLAUDE.md
 benchmark/GUIDED_GENERATION.md
+training/train.jsonl
+training/eval.jsonl
+training/adapter_training_toolkit*
+training/README.md
+training/exports/
+training/qlora-checkpoints/
+training/bench_mps_results.jsonl
diff --git a/README.md b/README.md
index cbd9bed..480f58f 100644
--- a/README.md
+++ b/README.md
@@ -171,6 +171,18 @@ make install
 
 This clones [tldr-pages](https://github.com/tldr-pages/tldr), parses all entries into Q/A pairs, adds macOS-specific overrides, and rebuilds the FTS5 index.
 
+## LoRA Adapter Training (experimental)
+
+The `training/` directory contains infrastructure for fine-tuning Apple's on-device 3B model using LoRA adapters. QLoRA training works on a free Colab T4 or locally on a 24GB Mac. See `training/TRAINING.md` for full details, results, and notebooks.
+
+```bash
+hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
+```
+
+Current finding: adapter + retrieval reaches ~86% accuracy (vs ~79% retrieval alone). QLoRA matches full LoRA quality, and Mac-trained adapters match T4-trained.
+
+> **Known bug (as of April 2026):** Apple's `TGOnDeviceInferenceProviderService` caches a full copy of the adapter (~160MB) on every CLI invocation and never cleans up. Repeated adapter calls from CLI tools can consume significant disk space. Apple has confirmed this as a known bug specific to CLI tools. See `training/adapter-disk-leak-findings.md` for details and workaround.
+
 ## Known limitations
 
 - **4K token context window** — the system prompt + 8 examples + query + output must fit. Current prompts use ~200-400 tokens, well within budget.
diff --git a/benchmark/REVIEW_CRITERIA.md b/benchmark/REVIEW_CRITERIA.md
new file mode 100644
index 0000000..8e9ac7e
--- /dev/null
+++ b/benchmark/REVIEW_CRITERIA.md
@@ -0,0 +1,95 @@
+# Benchmark Review Criteria
+
+Rules for deciding whether a non-exact result is "functionally correct" and should be added to alternates.json. Apply these consistently across all reviews.
+
+## ACCEPT — add to alternates.json
+
+### Placeholder variations
+Different placeholder names for the same command structure:
+- `file` vs `filename` vs `file.txt` — accept
+- `src dst` vs `source destination` vs `source_directory destination_directory` — accept
+- `user@host` vs `user@server` vs `username@remote_host` — accept
+- `example.com` vs `api.example.com` vs `localhost:8000` — accept
+
+### Quote style
+- Single vs double quotes: `'*.png'` vs `"*.png"` — accept
+- With or without quotes when not ambiguous: `-name .DS_Store` vs `-name '.DS_Store'` — accept
+
+### Flag reordering
+Same flags in different order:
+- `tar -xvzf` vs `tar -zxvf` — accept
+- `rsync -avz` vs `rsync -avzh` (extra harmless flag) — accept cautiously
+
+### Harmless extra flags
+Flags that don't change the core behavior:
+- `tar -czvf` (verbose) vs `tar -czf` — accept
+- `cp -R` vs `cp -r` (same on macOS) — accept
+- Adding `--progress` to rsync — accept
+
+### Format variations
+Same result, slightly different format:
+- `git log --oneline` vs `git log --pretty=oneline` — accept
+- `echo $SHELL` vs `echo $0` — accept (both show shell)
+
+## REJECT — do not add to alternates.json
+
+### Wrong command entirely
+- `system_profiler` for "monitor cpu usage" (should be `top`) — reject
+- `pbcopy` for "paste from clipboard" (that's copy, not paste) — reject
+- `cls` for "clear terminal" (Windows command) — reject
+
+### Wrong flags that change meaning
+- `find . -mtime -60` for "files changed in last hour" (`-mtime` is days, not minutes) — reject
+- `find . -mtime +1` for "files modified today" (opposite: MORE than 1 day ago) — reject
+- `head -50` for "last 50 lines" (head shows FIRST, not last) — reject
+- `tail -n 20` for "first 20 lines" (tail shows LAST, not first) — reject
+
+### Missing critical parts
+- `cp -r directory` (missing destination) — reject
+- `find .DS_Store -delete` (missing `.` path, only current dir entry) — reject
+- `zip -r .` (missing output filename) — reject
+- `ssh user@server` (missing `-i key` when prompt asks for specific key) — reject
+
+### Hallucinated commands/flags
+- `git log --no-pushed` (not a real flag) — reject
+- `git rename-branch` (not a real command) — reject
+- `find . -type symlink` (invalid type, should be `l`) — reject
+- `link -s` (not the same as `ln -s`) — reject
+- `zipdir`, `pylist`, `mcal` — reject
+
+### Broadened scope
+- `find . -empty` for "find empty directories" (also finds empty files) — reject
+- `find . -name node_modules` for "find directories named node_modules" (also finds files) — accept only with `-type d`
+- `git branch --merged | xargs git branch -d` without `grep -v main` (would delete main) — reject
+
+### Functionally different approach
+- `comm -12 <(sort file1) <(sort file2)` for "compare two files" (shows common lines, not differences) — reject
+- `du -sh /` for "show disk usage" (directory usage, not filesystem usage like `df`) — reject
+- `find . -name '*.py' | wc -l` for "count lines in python files" (counts FILES, not lines IN them) — reject
+
+### Piped through unnecessary commands
+- `cat file | head -20` for "first 20 lines" — accept (useless cat but correct)
+- `find ... | wc -l` when it should be `find ... -exec wc -l` — reject (counts files not lines)
+
+## EDGE CASES
+
+### `find . -empty` for "find empty directories"
+REJECT. `-empty` matches both empty files and directories. The prompt specifically asks for directories. Need `-type d -empty`.
+
+### `sips -s format jpeg input.jpg --out output.jpg` (same format in and out)
+REJECT. The prompt says "convert to different format." While the command structure is correct, the example converts jpg→jpg. Accept only if input and output formats differ.
+
+### `sips -s format jpg` (without `--out`)
+REJECT. `jpg` is not a valid sips format name (should be `jpeg`).
+
+### curl POST with different URLs/bodies
+ACCEPT if structure is correct: has `-X POST`, has `-H "Content-Type: application/json"`, has `-d`. Different URLs and body content are just placeholder variations.
+
+### rsync with `--delete`
+REJECT. Adding `--delete` removes files at destination that don't exist at source. That's a meaningfully different and potentially destructive operation.
+
+### `caffeinate -t 3600` for "prevent mac from sleeping"
+ACCEPT. Keeps awake for 1 hour — reasonable interpretation.
+
+### `env | grep PATH` vs `export PATH`
+ACCEPT both. Different mechanisms but both show PATH.
diff --git a/benchmark/alternates.json b/benchmark/alternates.json
index 239acf9..5079c1d 100644
--- a/benchmark/alternates.json
+++ b/benchmark/alternates.json
@@ -10,7 +10,9 @@
     "ls",
     "ls -la",
     "ls -a",
-    "ls -l"
+    "ls -l",
+    "ls -1",
+    "ls ."
   ],
   "3": [
     "df -h",
@@ -62,7 +64,8 @@
     "find . -name '*.png'",
     "find . -name \"*.png\"",
     "find . -iname '*.png'",
-    "find . -type f -name '*.png'"
+    "find . -type f -name '*.png'",
+    "find . -type f -name \"*.png\""
   ],
   "14": [
     "find . -type d -empty",
@@ -83,7 +86,9 @@
     "find . -name '.DS_Store' -delete",
     "find . -name .DS_Store -delete",
     "find . -name '.DS_Store' -exec rm {} +",
-    "find . -name '.DS_Store' -exec rm {} \\;"
+    "find . -name '.DS_Store' -exec rm {} \\;",
+    "find . -name \".DS_Store\" -delete",
+    "find . -name \".DS_Store\" -exec rm {} \\;"
   ],
   "18": [
     "find . -type l"
@@ -97,13 +102,17 @@
     "find . -type d -name 'node_modules'",
     "find -name 'node_modules'",
     "find . -name 'node_modules'",
-    "find . -name \"node_modules\""
+    "find . -name \"node_modules\"",
+    "find . -name node_modules"
   ],
   "21": [
     "find . -name '*.py' -exec wc -l {} +",
     "find . -name '*.py' | xargs wc -l",
     "wc -l **/*.py",
-    "find . -name '*.py' -exec wc -l {} \\;"
+    "find . -name '*.py' -exec wc -l {} \\;",
+    "find . -name \"*.py\" -exec wc -l {} +",
+    "find . -name \"*.py\" | xargs wc -l",
+    "wc -l *.py"
   ],
   "22": [
     "du -sh * | sort -hr",
@@ -113,7 +122,8 @@
   "23": [
     "kill $(lsof -t -i :3000)",
     "lsof -t -i :3000 | xargs kill",
-    "fuser -k 3000/tcp"
+    "fuser -k 3000/tcp",
+    "kill $(lsof -t -i :3000 )"
   ],
   "24": [
     "find . -size +1G",
@@ -139,7 +149,15 @@
     "tar -czf compressed_folder.tar.gz ./",
     "tar -czf folder.tar.gz /path/to/folder",
     "tar czf archive.tar.gz /path/to/folder",
-    "tar czf folder.tar.gz folder"
+    "tar czf folder.tar.gz folder",
+    "tar -czf file.tar.gz folder",
+    "tar -czf folder.tar.gz .",
+    "tar -czf folder.tar.gz folder",
+    "tar -czvf folder.tar.gz folder",
+    "tar -czf path/to/compressed.tar.gz path/to/folder",
+    "tar -czvf /path/to/output.tar.gz /path/to/folder",
+    "tar -czf file.tar.gz .",
+    "tar -czf archive.tar.gz folder"
   ],
   "28": [
     "tar xzf file.tar.gz",
@@ -148,14 +166,22 @@
     "tar -xf archive.tar.gz",
     "tar xzvf archive.tar.gz",
     "tar -xvzf file.tar.gz",
-    "tar xvf file.tar.gz"
+    "tar xvf file.tar.gz",
+    "tar -xvf archive.tar.gz",
+    "tar -xvzf archive.tar.gz",
+    "tar -xvzf filename.tar.gz",
+    "tar -xzvf file.tar.gz",
+    "tar -zxvf archive.tar.gz",
+    "tar -zxvf file.tar.gz",
+    "tar -xvf file.tar.gz"
   ],
   "29": [
     "git branch --sort=-committerdate",
     "git branch -a --sort=-committerdate"
   ],
   "30": [
-    "git log --oneline"
+    "git log --oneline",
+    "git log --pretty=oneline"
   ],
   "31": [
     "git diff --staged",
@@ -169,14 +195,16 @@
   "33": [
     "git log origin/main..HEAD",
     "git log origin/master..HEAD",
-    "git log --oneline origin/main..HEAD"
+    "git log --oneline origin/main..HEAD",
+    "git log origin/main..HEAD --oneline"
   ],
   "34": [
     "git branch -m old new",
     "git branch -m oldname newname",
     "git branch -m <branch_name> <new_branch_name>",
     "git branch -m old_branch_name new_branch_name",
-    "git branch -m new_branch_name"
+    "git branch -m new_branch_name",
+    "git branch -m branch_name1 branch_name2"
   ],
   "35": [
     "git branch --merged | grep -v main | xargs git branch -d",
@@ -185,7 +213,8 @@
   "36": [
     "netstat -an",
     "netstat",
-    "lsof -i"
+    "lsof -i",
+    "netstat -ln"
   ],
   "37": [
     "lsof -i :8080",
@@ -197,7 +226,12 @@
     "curl -o file https://example.com/file",
     "wget https://example.com/file",
     "curl -O url",
-    "curl -o filename url"
+    "curl -o filename url",
+    "curl -O https://example.com/file.txt",
+    "curl -O https://example.com/file.zip",
+    "curl -o file.zip https://example.com/file.zip",
+    "wget https://example.com/file.pdf",
+    "wget https://example.com/file.zip -O file.zip"
   ],
   "39": [
     "curl -I https://example.com",
@@ -210,7 +244,11 @@
     "curl -X POST https://example.com/api/endpoint -H 'Content-Type: application/json' -d '{\"key\": \"value\"}'",
     "curl -X POST -H 'Content-Type: application/json' -d '{\"key\": \"value\"}' https://example.com",
     "curl -X POST https://example.com -H 'Content-Type: application/json' -d '{\"key1\": \"value1\", \"key2\": \"value2\"}'",
-    "curl -X POST -H 'Content-Type: application/json' -d '{\"name\": \"john\", \"age\": 25}' https://example.com"
+    "curl -X POST -H 'Content-Type: application/json' -d '{\"name\": \"john\", \"age\": 25}' https://example.com",
+    "curl -X POST -H \"Content-Type: application/json\" -d '{\"key\": \"value\"}' https://example.com",
+    "curl -X POST http://localhost:8000/api/endpoint -H \"Content-Type: application/json\" -d '{\"key\": \"value\"}'",
+    "curl -X POST http://localhost:8000/api/post -H \"Content-Type: application/json\" -d '{\"key\":\"value\"}'",
+    "curl -X POST https://api.example.com/endpoint -H \"Content-Type: application/json\" -d '{\"key1\": \"value1\", \"key2\": \"value2\"}'"
   ],
   "41": [
     "tail -f logfile",
@@ -224,7 +262,9 @@
   ],
   "42": [
     "tail -n 50 file",
-    "tail -50 file"
+    "tail -50 file",
+    "tail -50 filename",
+    "tail -n 50 filename"
   ],
   "43": [
     "ls | wc -l",
@@ -249,25 +289,34 @@
     "cp -R src dst",
     "cp -a src dst",
     "cp -r source_directory destination_directory",
-    "cp -r path/to/source_directory path/to/target_directory"
+    "cp -r path/to/source_directory path/to/target_directory",
+    "cp -R source_directory destination_directory",
+    "cp -r src/ dst/",
+    "cp -r src/ dest/"
   ],
   "48": [
     "mkdir -p path/to/dir",
     "mkdir -p /path/to/create/directory",
     "mkdir -p \"path/to/directory\"",
-    "mkdir -p parent_directory_path"
+    "mkdir -p parent_directory_path",
+    "mkdir -p /path/to/directory",
+    "mkdir -p path/to/directory",
+    "mkdir -p directory",
+    "mkdir -p directory_name"
   ],
   "49": [
     "chmod +x file",
     "chmod 755 file",
-    "chmod +x executable"
+    "chmod +x executable",
+    "chmod +x filename"
   ],
   "50": [
     "stat -f '%A' file",
     "stat -f '%Lp' file"
   ],
   "51": [
-    "find . -type f -exec md5 {} + | sort | uniq -d"
+    "find . -type f -exec md5 {} + | sort | uniq -d",
+    "find . -type f -exec md5 {} + | sort | uniq -d | sort"
   ],
   "52": [
     "top",
@@ -287,18 +336,23 @@
   "55": [
     "pkill processname",
     "killall processname",
-    "pkill process-name"
+    "pkill process-name",
+    "pkill myprocess",
+    "pkill bash",
+    "pkill shell_name"
   ],
   "56": [
     "dig example.com",
     "nslookup example.com",
-    "host example.com"
+    "host example.com",
+    "dig domain.com"
   ],
   "57": [
     "nc -zv host 80",
     "nc -z host 80",
     "nmap -p 80 host",
-    "nc -zv hostname port"
+    "nc -zv hostname port",
+    "nc -zv hostname 80"
   ],
   "58": [
     "openssl rand -base64 32",
@@ -307,7 +361,8 @@
   ],
   "59": [
     "md5 file",
-    "md5sum file"
+    "md5sum file",
+    "md5 file.txt"
   ],
   "60": [
     "shasum -a 256 file",
@@ -336,17 +391,22 @@
   ],
   "65": [
     "caffeinate",
-    "caffeinate -d"
+    "caffeinate -d",
+    "caffeinate -t 3600",
+    "caffeinate -t 86400"
   ],
   "66": [
     "say hello",
     "say 'hello'",
     "say \"hello\"",
     "say 'Hello, world!'",
-    "say 'hello world'"
+    "say 'hello world'",
+    "say \"Hello, world!\"",
+    "say \"hello world\""
   ],
   "67": [
-    "pmset -g batt"
+    "pmset -g batt",
+    "pmset -g"
   ],
   "68": [
     "sudo dscacheutil -flushcache",
@@ -380,7 +440,10 @@
   "75": [
     "sips -s format png input.jpg --out output.png",
     "convert input.jpg output.png",
-    "convert path/to/input_image.jpg path/to/output_image.png"
+    "convert path/to/input_image.jpg path/to/output_image.png",
+    "sips -s format jpeg input.png --out output.jpeg",
+    "sips -s format png input.jpg",
+    "sips -s format webp input.jpg --out output.webp"
   ],
   "76": [
     "sips --resampleWidth 800 image.jpg",
@@ -405,7 +468,10 @@
     "ln -s source_path target_path",
     "ln -s source destination",
     "ln -s /path/to/file /path/to/symlink",
-    "ln -s path/to/file_or_directory path/to/symlink"
+    "ln -s path/to/file_or_directory path/to/symlink",
+    "ln -s source_path destination_path",
+    "ln -s src dest",
+    "ln -s src dst"
   ],
   "81": [
     "lsof -i -P -n | grep LISTEN",
@@ -425,12 +491,16 @@
     "git cherry-pick <commit>",
     "git cherry-pick commit",
     "git cherry-pick <commit-hash>",
-    "git cherry-pick HEAD~1"
+    "git cherry-pick HEAD~1",
+    "git cherry-pick HEAD^",
+    "git cherry-pick commit-hash"
   ],
   "85": [
     "ls -lh file",
     "ls -lh",
-    "du -sh file"
+    "du -sh file",
+    "du -h file",
+    "du -hs filename"
   ],
   "86": [
     "find . -perm 777",
@@ -441,20 +511,34 @@
   "87": [
     "head -n 20 file",
     "head -20 file",
-    "head -n 20 < filename"
+    "head -n 20 < filename",
+    "head -20 filename",
+    "cat file | head -20"
   ],
   "88": [
     "ssh -i key.pem user@host",
     "ssh -i path/to/key_file username@remote_host",
     "ssh -i /path/to/key username@host",
     "ssh username@host -i path/to/key",
-    "ssh username@remote_host -i path/to/key_file"
+    "ssh username@remote_host -i path/to/key_file",
+    "ssh user@hostname -i path/to/key_file.pem",
+    "ssh user@server -i path/to/key.pem",
+    "ssh user@server -i path/to/key_file.pem",
+    "ssh user@server.example.com -i ~/.ssh/id_rsa",
+    "ssh user@host -i path/to/key",
+    "ssh user@host -i path/to/key_file.pem"
   ],
   "89": [
     "rsync -avz src/ user@host:dst/",
     "rsync -avz src/ user@host:dst",
     "rsync -avz /path/to/source /path/to/destination",
-    "rsync -avz source_path destination_path"
+    "rsync -avz source_path destination_path",
+    "rsync -avz source_directory remote_server",
+    "rsync -avz . remote_server:destination_directory",
+    "rsync -avz --progress source_directory remote_server",
+    "rsync -avz --progress src_dir remote_server",
+    "rsync -avz /path/to/local/directory remote_server:destination_directory",
+    "rsync -avz /path/to/local/directory remote_server:path/to/remote/directory"
   ],
   "90": [
     "crontab -l"
@@ -462,7 +546,9 @@
   "91": [
     "grep -ri pattern .",
     "grep -ri 'pattern' .",
-    "grep -rni pattern ."
+    "grep -rni pattern .",
+    "find . -type f -exec grep -ri 'pattern' {} +",
+    "find . -type f -exec grep -ri 'pattern' +"
   ],
   "92": [
     "wc file",
@@ -473,7 +559,16 @@
     "zip -r archive.zip directory",
     "zip -r archive.zip dir",
     "zip -r archive.zip directory/",
-    "zip -r /path/to/directory.zip /path/to/directory"
+    "zip -r /path/to/directory.zip /path/to/directory",
+    "zip -r archive.zip .",
+    "zip -r archive.zip ./",
+    "zip -r directory_name.zip directory",
+    "zip -r file.zip .",
+    "zip -r file.zip ./",
+    "zip -r mydir.zip ./mydir",
+    "zip -r myfile.zip mydir",
+    "zip -r archive.zip directory_to_zip",
+    "zip -r file.zip directory"
   ],
   "94": [
     "unzip file.zip -d directory",
@@ -483,7 +578,14 @@
     "unzip -d /path/to/output file.zip",
     "unzip -d destination file.zip",
     "unzip filename -d destination_directory",
-    "unzip filename -d destination"
+    "unzip filename -d destination",
+    "unzip -d target_dir filename",
+    "unzip file.zip -d /path/to/directory",
+    "unzip file.zip -d /path/to/unzip",
+    "unzip file.zip -d destination",
+    "unzip file.zip -d destination_directory",
+    "unzip file.zip -d /path/to/unzipped",
+    "unzip file.zip -d destination/"
   ],
   "95": [
     "system_profiler SPHardwareDataType"
@@ -492,7 +594,9 @@
     "system_profiler SPUSBDataType"
   ],
   "97": [
-    "date -r 1700000000"
+    "date -r 1700000000",
+    "date -r $TIMESTAMP",
+    "date -r $UNIX_TIMESTAMP"
   ],
   "98": [
     "stat -f '%B' file | xargs date -r",
@@ -503,11 +607,15 @@
     "echo -n 'string' | base64",
     "printf 'string' | base64",
     "echo 'string' | base64",
-    "echo -n 'text' | base64"
+    "echo -n 'text' | base64",
+    "echo 'input string' | base64",
+    "echo 'your string' | base64",
+    "echo 'your_string' | base64"
   ],
   "100": [
     "env | grep PATH",
     "printenv | grep PATH",
-    "echo $PATH"
+    "echo $PATH",
+    "export PATH"
   ]
 }
\ No newline at end of file
diff --git a/benchmark/run.py b/benchmark/run.py
index b7365ed..6a1da41 100755
--- a/benchmark/run.py
+++ b/benchmark/run.py
@@ -504,6 +504,135 @@ def approach_hunch_multi_warm(prompt):
     return _run_hunch(prompt, ["--guided", "multi", "--temperature", "0.3"])
 
 
+ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "exports" / "hunch.fmadapter")
+QLORA_FP16_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_fp16.fmadapter")
+QLORA_NF4_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora.fmadapter")
+QLORA_OVERRIDE_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_overrides.fmadapter")
+LORA_OVERRIDE_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "exports" / "hunch_overrides.fmadapter")
+
+
+def _run_hunch_batch(prompts, extra_args=None, runs=1):
+    """Run all prompts in a single hunch process using --batch mode.
+
+    This avoids the TGOnDeviceInferenceProviderService disk leak where each
+    process invocation caches a ~160MB copy of the adapter.
+
+    Returns: dict keyed by (run, id) if runs > 1, or by id if runs == 1.
+    """
+    # Write prompts to a temp JSONL file
+    import tempfile
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f:
+        for p in prompts:
+            f.write(json.dumps({"id": p["id"], "prompt": p["prompt"]}) + "\n")
+        batch_path = f.name
+
+    cmd = ["hunch"]
+    if extra_args:
+        cmd.extend(extra_args)
+    cmd.extend(["--batch", batch_path])
+    if runs > 1:
+        cmd.extend(["--runs", str(runs)])
+
+    try:
+        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+        results = {}
+        count = 0
+        total = len(prompts) * runs
+        for line in proc.stdout:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                r = json.loads(line)
+                count += 1
+                status = r.get("result", "")[:40]
+                print(f"  [{count}/{total}] #{r.get('id', '?'):3d}: {r.get('prompt', '')[:50]:50s} → {status} ({r.get('total_time', 0)}s)")
+                if runs > 1:
+                    results[(r["run"], r["id"])] = r
+                else:
+                    results[r["id"]] = r
+            except (json.JSONDecodeError, KeyError):
+                continue
+        proc.wait()
+        return results
+    except Exception:
+        return {}
+    finally:
+        os.unlink(batch_path)
+
+
+def _make_batch_approach(extra_args):
+    """Create a batch-aware approach function for adapter benchmarks."""
+    def approach(prompt):
+        # Fallback for single-prompt calls (e.g. --ids)
+        return _run_hunch(prompt, extra_args)
+    approach._batch_args = extra_args
+    return approach
+
+
+def approach_adapter_only(prompt):
+    """LoRA adapter only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", ADAPTER_PATH, "--limit", "0"])
+approach_adapter_only._batch_args = ["--adapter", ADAPTER_PATH, "--limit", "0"]
+
+
+def approach_adapter_retrieval(prompt):
+    """LoRA adapter + retrieval."""
+    return _run_hunch(prompt, ["--adapter", ADAPTER_PATH])
+approach_adapter_retrieval._batch_args = ["--adapter", ADAPTER_PATH]
+
+
+def approach_fp16lora_only(prompt):
+    """fp16 LoRA adapter only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH, "--limit", "0"])
+approach_fp16lora_only._batch_args = ["--adapter", QLORA_FP16_ADAPTER_PATH, "--limit", "0"]
+
+
+def approach_fp16lora_retrieval(prompt):
+    """fp16 LoRA adapter + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH])
+approach_fp16lora_retrieval._batch_args = ["--adapter", QLORA_FP16_ADAPTER_PATH]
+
+
+def approach_qlora_only(prompt):
+    """True QLoRA (NF4) adapter only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH, "--limit", "0"])
+approach_qlora_only._batch_args = ["--adapter", QLORA_NF4_ADAPTER_PATH, "--limit", "0"]
+
+
+def approach_qlora_retrieval(prompt):
+    """True QLoRA (NF4) adapter + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH])
+approach_qlora_retrieval._batch_args = ["--adapter", QLORA_NF4_ADAPTER_PATH]
+
+
+def approach_qlora_override_only(prompt):
+    """QLoRA trained on overrides only, no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"])
+approach_qlora_override_only._batch_args = ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]
+
+
+def approach_qlora_override_retrieval(prompt):
+    """QLoRA trained on overrides only + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH])
+approach_qlora_override_retrieval._batch_args = ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH]
+
+
+QLORA_MPS_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_mps.fmadapter")
+
+
+def approach_qlora_mps_only(prompt):
+    """QLoRA trained on Mac (MPS), no retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_MPS_ADAPTER_PATH, "--limit", "0"])
+approach_qlora_mps_only._batch_args = ["--adapter", QLORA_MPS_ADAPTER_PATH, "--limit", "0"]
+
+
+def approach_qlora_mps_retrieval(prompt):
+    """QLoRA trained on Mac (MPS) + retrieval."""
+    return _run_hunch(prompt, ["--adapter", QLORA_MPS_ADAPTER_PATH])
+approach_qlora_mps_retrieval._batch_args = ["--adapter", QLORA_MPS_ADAPTER_PATH]
+
+
 def approach_dynshot_tldr(prompt):
     """Dynamic few-shot using tldr+overrides FTS5 index (21k entries)."""
     import sqlite3
@@ -576,6 +705,18 @@ def approach_dynshot_holdout(prompt):
     "hunch-multi": approach_hunch_multi,
     "hunch-cotmulti": approach_hunch_cotmulti,
     "hunch-multi-warm": approach_hunch_multi_warm,
+    "adapter-only": approach_adapter_only,
+    "adapter-retrieval": approach_adapter_retrieval,
+    "fp16lora-only": approach_fp16lora_only,
+    "fp16lora-retrieval": approach_fp16lora_retrieval,
+    "qlora-only": approach_qlora_only,
+    "qlora-retrieval": approach_qlora_retrieval,
+    "qlora-override-only": approach_qlora_override_only,
+    "qlora-override-retrieval": approach_qlora_override_retrieval,
+    "qlora-mps-only": approach_qlora_mps_only,
+    "qlora-mps-retrieval": approach_qlora_mps_retrieval,
+    "lora-override-only": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]),
+    "lora-override-retrieval": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH]),
     "hunch-sc": approach_hunch_sc,
     "sc-dynshot": approach_selfconsist_dynshot,
     "sc-warm": approach_selfconsist_warm,
@@ -595,14 +736,51 @@ def load_prompts(ids=None, category=None):
     return prompts
 
 
-def run_benchmark(approach_name, prompts):
+def run_benchmark(approach_name, prompts, suffix="", runs=1):
     func = APPROACHES[approach_name]
-    outfile = RESULTS_DIR / f"{approach_name}.jsonl"
 
     print(f"\n{'=' * 60}")
-    print(f"  APPROACH: {approach_name} ({len(prompts)} prompts)")
+    print(f"  APPROACH: {approach_name} ({len(prompts)} prompts{f', {runs} runs' if runs > 1 else ''})")
     print(f"{'=' * 60}")
 
+    # Use batch mode for adapter approaches (avoids disk leak)
+    batch_args = getattr(func, '_batch_args', None)
+    if batch_args and len(prompts) > 1:
+        print(f"  Using --batch mode (single process, avoids adapter disk leak)")
+        batch_results = _run_hunch_batch(prompts, batch_args, runs=runs)
+
+        all_results = []
+        for run_num in range(1, runs + 1):
+            run_suffix = f"-run{run_num}" if runs > 1 else ""
+            outfile = RESULTS_DIR / f"{approach_name}{suffix}{run_suffix}.jsonl"
+
+            results = []
+            with open(outfile, "w") as f:
+                for p in prompts:
+                    if runs > 1:
+                        br = batch_results.get((run_num, p["id"]), {})
+                    else:
+                        br = batch_results.get(p["id"], {})
+                    r = {
+                        "result": br.get("result", "[BATCH_ERROR]"),
+                        "total_time": br.get("total_time", 0),
+                    }
+                    r["id"] = p["id"]
+                    r["approach"] = approach_name
+                    r["prompt"] = p["prompt"]
+                    r["expected"] = p["expected"]
+                    r["category"] = p["category"]
+
+                    f.write(json.dumps(r) + "\n")
+                    f.flush()
+                    results.append(r)
+
+            print(f"  Saved to {outfile}")
+            all_results.extend(results)
+
+        return all_results
+
+    outfile = RESULTS_DIR / f"{approach_name}{suffix}.jsonl"
     results = []
     with open(outfile, "w") as f:
         for i, p in enumerate(prompts):
@@ -637,6 +815,7 @@ def main():
     parser.add_argument("approach", nargs="?", default="all", help="Approach name or 'all'")
     parser.add_argument("--ids", help="Comma-separated prompt IDs")
     parser.add_argument("--category", help="Filter by category: simple, flags, composed")
+    parser.add_argument("--runs", type=int, default=1, help="Number of runs (output files suffixed -run1, -run2, ...)")
     args = parser.parse_args()
 
     ids = [int(x) for x in args.ids.split(",")] if args.ids else None
@@ -655,7 +834,22 @@ def main():
         if a not in APPROACHES:
             print(f"Unknown approach: {a}. Available: {', '.join(APPROACHES.keys())}")
             sys.exit(1)
-        run_benchmark(a, prompts)
+
+    for a in approaches:
+        func = APPROACHES[a]
+        batch_args = getattr(func, '_batch_args', None)
+        if batch_args and args.runs > 1:
+            # Adapter approaches: all runs in one process
+            run_benchmark(a, prompts, runs=args.runs)
+        elif args.runs > 1:
+            # Non-adapter approaches: loop externally
+            for run_num in range(1, args.runs + 1):
+                print(f"\n{'#' * 60}")
+                print(f"  RUN {run_num}/{args.runs}")
+                print(f"{'#' * 60}")
+                run_benchmark(a, prompts, suffix=f"-run{run_num}")
+        else:
+            run_benchmark(a, prompts)
 
     print(f"\nDone. Run: python3 score.py")
 
diff --git a/cli/Sources/Hunch/main.swift b/cli/Sources/Hunch/main.swift
index 18c9c65..583d0b2 100644
--- a/cli/Sources/Hunch/main.swift
+++ b/cli/Sources/Hunch/main.swift
@@ -158,6 +158,9 @@ struct Hunch {
         let samples = parseFlag(&args, flag: "--samples").flatMap(Int.init) ?? 1
         let limit = parseFlag(&args, flag: "--limit").flatMap(Int.init) ?? 8
         let guided = parseFlag(&args, flag: "--guided")
+        let adapterPath = parseFlag(&args, flag: "--adapter")
+        let batchFile = parseFlag(&args, flag: "--batch")
+        let batchRuns = parseFlag(&args, flag: "--runs").flatMap(Int.init) ?? 1
 
         // Parse mode
         var mode: Mode = .suggest
@@ -169,6 +172,20 @@ struct Hunch {
             args.removeFirst()
         }
 
+        // Batch mode: read prompts from JSONL, run all in one process
+        if let batchFile {
+            do {
+                try await runBatch(
+                    file: batchFile, adapterPath: adapterPath, temperature: temperature,
+                    limit: limit, guided: guided, runs: batchRuns
+                )
+            } catch {
+                fputs("error: \(error.localizedDescription)\n", stderr)
+                exit(1)
+            }
+            return
+        }
+
         guard !args.isEmpty else {
             printUsage()
             return
@@ -238,9 +255,19 @@ struct Hunch {
         let systemPrompt = buildSystemPrompt(mode: mode, examples: examples)
 
         do {
-            let model = SystemLanguageModel(
-                guardrails: .permissiveContentTransformations
-            )
+            let model: SystemLanguageModel
+            if let adapterPath {
+                let adapterURL = URL(fileURLWithPath: adapterPath)
+                let adapter = try SystemLanguageModel.Adapter(fileURL: adapterURL)
+                model = SystemLanguageModel(
+                    adapter: adapter,
+                    guardrails: .permissiveContentTransformations
+                )
+            } else {
+                model = SystemLanguageModel(
+                    guardrails: .permissiveContentTransformations
+                )
+            }
 
             // Build generation options only when temperature is set
             let genOptions: GenerationOptions? = temperature.map {
@@ -380,6 +407,127 @@ struct Hunch {
         }
     }
 
+    static func runBatch(
+        file: String, adapterPath: String?, temperature: Double?,
+        limit: Int, guided: String?, runs: Int = 1
+    ) async throws {
+        // Read JSONL file
+        let contents = try String(contentsOfFile: file, encoding: .utf8)
+        let lines = contents.components(separatedBy: .newlines).filter { !$0.isEmpty }
+
+        // Load model once
+        let model: SystemLanguageModel
+        if let adapterPath {
+            let adapterURL = URL(fileURLWithPath: adapterPath)
+            let adapter = try SystemLanguageModel.Adapter(fileURL: adapterURL)
+            model = SystemLanguageModel(
+                adapter: adapter,
+                guardrails: .permissiveContentTransformations
+            )
+        } else {
+            model = SystemLanguageModel(
+                guardrails: .permissiveContentTransformations
+            )
+        }
+
+        let genOptions: GenerationOptions? = temperature.map {
+            var opts = GenerationOptions()
+            opts.temperature = $0
+            return opts
+        }
+
+        let dbPath = findDatabase()
+
+        for run in 1...runs {
+            for line in lines {
+                guard let data = line.data(using: .utf8),
+                      let entry = try? JSONSerialization.jsonObject(with: data) as? [String: Any],
+                      let idValue = entry["id"], let id = idValue as? Int ?? (idValue as? NSNumber)?.intValue,
+                      let prompt = entry["prompt"] as? String else {
+                    continue
+                }
+
+                let start = CFAbsoluteTimeGetCurrent()
+                var result: String
+
+                do {
+                    let examples = dbPath != nil
+                        ? searchBank(dbPath: dbPath!, query: prompt, limit: limit)
+                        : []
+                    let systemPrompt = buildSystemPrompt(mode: .suggest, examples: examples)
+
+                    let session: LanguageModelSession
+                    if !systemPrompt.isEmpty {
+                        let segment = Transcript.TextSegment(content: systemPrompt)
+                        let instructions = Transcript.Instructions(
+                            segments: [.text(segment)],
+                            toolDefinitions: []
+                        )
+                        session = LanguageModelSession(
+                            model: model,
+                            transcript: Transcript(entries: [.instructions(instructions)])
+                        )
+                    } else {
+                        session = LanguageModelSession(model: model)
+                    }
+
+                    if guided == "plain" {
+                        let response: LanguageModelSession.Response<ShellCommand>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, generating: ShellCommand.self, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt, generating: ShellCommand.self)
+                        }
+                        result = response.content.command
+                    } else if guided == "cot" {
+                        let response: LanguageModelSession.Response<ShellCommandCoT>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, generating: ShellCommandCoT.self, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt, generating: ShellCommandCoT.self)
+                        }
+                        result = response.content.command
+                    } else if guided == "multi" {
+                        let response: LanguageModelSession.Response<ShellCommandMulti>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, generating: ShellCommandMulti.self, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt, generating: ShellCommandMulti.self)
+                        }
+                        result = majorityVote([response.content.first, response.content.second, response.content.third])
+                    } else {
+                        // Default: plain string
+                        let response: LanguageModelSession.Response<String>
+                        if let opts = genOptions {
+                            response = try await session.respond(to: prompt, options: opts)
+                        } else {
+                            response = try await session.respond(to: prompt)
+                        }
+                        result = stripMarkdown(response.content)
+                    }
+                } catch {
+                    result = "[ERROR] \(error.localizedDescription)"
+                }
+
+                let elapsed = round((CFAbsoluteTimeGetCurrent() - start) * 100) / 100
+                var output: [String: Any] = [
+                    "id": id,
+                    "prompt": prompt,
+                    "result": result,
+                    "total_time": elapsed
+                ]
+                if runs > 1 {
+                    output["run"] = run
+                }
+                if let jsonData = try? JSONSerialization.data(withJSONObject: output),
+                   let jsonString = String(data: jsonData, encoding: .utf8) {
+                    print(jsonString)
+                    fflush(stdout)
+                }
+            }
+        }
+    }
+
     static func printUsage() {
         let dbStatus = findDatabase() != nil ? "found" : "not found"
         let envTemp = ProcessInfo.processInfo.environment["HUNCH_TEMPERATURE"] ?? "not set"
diff --git a/training/TRAINING.md b/training/TRAINING.md
new file mode 100644
index 0000000..b3c0b88
--- /dev/null
+++ b/training/TRAINING.md
@@ -0,0 +1,274 @@
+# Training Guide
+
+How to train a LoRA adapter for Apple's on-device 3B Foundation Model using the hunch dataset.
+
+## Prerequisites
+
+1. **Apple Developer Program** ($99/year) — needed to download the training toolkit
+2. **Adapter training toolkit** — download from [developer.apple.com/apple-intelligence/foundation-models-adapter/](https://developer.apple.com/apple-intelligence/foundation-models-adapter/)
+3. **Google account** — for Colab (free tier works for QLoRA and fp16 LoRA)
+
+## Files
+
+```
+training/
+├── train_lora.ipynb              # LoRA training notebook (needs A100)
+├── train_lora_fp16.ipynb         # fp16 LoRA training notebook (works on free T4)
+├── train_qlora.ipynb             # QLoRA training notebook (works on free T4, recommended)
+├── train_qlora_full.py           # QLoRA training script (T4 or Mac)
+├── train_qlora_test.py           # Quick smoke test (load model, one forward/backward pass)
+├── prepare_data.py               # Converts hunch bank → training JSONL
+├── bench_mps.py                  # Metal vs CPU fallback benchmark
+└── TRAINING.md                   # This file
+```
+
+## Quick Start
+
+### 1. Download the toolkit
+
+Download from developer.apple.com, extract into this directory:
+
+```
+training/adapter_training_toolkit_v26_0_0/
+├── assets/          # Base model weights (12GB)
+├── examples/        # Training scripts
+├── export/          # .fmadapter export
+└── requirements.txt
+```
+
+### 2. Choose your path
+
+| Path | GPU | Cost | VRAM | Time (overrides) | Time (full bank) |
+|------|-----|------|------|------------------|------------------|
+| **QLoRA on Mac** | Apple Silicon | **Free, local** | **~5GB** | **~34 min** | ~hours |
+| QLoRA on Colab | T4 16GB | Free | ~5GB | ~5 min | ~1.7 hours |
+| fp16 LoRA on Colab | T4 16GB | Free | ~8.5GB | ~10 min | ~2 hours |
+| LoRA on Colab | A100 40GB | Colab Pro ($10/mo) | ~15GB | ~5 min | ~2.5 hours |
+
+**QLoRA is recommended.** Same adapter quality as full LoRA, lowest memory, fewest patches. Mac training is ~7x slower than T4 but fully local.
+
+### Path A: Train on Mac (recommended for small datasets)
+
+```bash
+cd training/adapter_training_toolkit_v26_0_0
+source venv/bin/activate
+
+# Install native Metal kernel support for bitsandbytes
+pip install kernels
+pip install --force-reinstall git+https://github.com/bitsandbytes-foundation/bitsandbytes.git
+
+# Prepare data and train
+cd ..
+python3 prepare_data.py --sources override
+python3 train_qlora_full.py --epochs 20 --batch-size 8
+
+# Export — bitsandbytes from main pulls in PyTorch 2.11, but coremltools 8.3.0
+# ships native C extensions only for Python ≤3.13 and PyTorch ≤2.5.
+# Create a separate env with compatible versions:
+cd adapter_training_toolkit_v26_0_0
+python3.12 -m venv export-env
+source export-env/bin/activate
+pip install torch==2.5.0 coremltools==8.3.0
+python3 -m export.export_fmadapter \
+  --adapter-name hunch_qlora \
+  --checkpoint ../qlora-checkpoints/adapter-final.pt \
+  --output-dir ../qlora-checkpoints/
+```
+
+Notes:
+- Requires bitsandbytes from git main (pre-v0.50.0) with native MPS kernels (PR #1875)
+- The `kernels` package downloads pre-compiled Metal shaders from HuggingFace Hub at runtime
+- Don't use `bnb_4bit_use_double_quant=True` — not wired for MPS yet
+- ~34 min for 20 epochs of 96 examples on M4, ~5GB GPU peak. Full bank (~19k) would take hours
+
+### Path B: Train on Colab
+
+Upload to Google Drive:
+
+```
+My Drive/hunch-training/
+├── adapter_training_toolkit_v26_0_0/   # The extracted toolkit
+├── prepare_data.py                      # From this directory
+├── train_qlora_full.py                  # From this directory (for QLoRA)
+├── tldr_bank.db                         # From ../bank/
+└── prompts.jsonl                        # From ../benchmark/
+```
+
+Choose a notebook:
+
+| Notebook | GPU | Patches |
+|----------|-----|---------|
+| `train_qlora.ipynb` | T4 16GB (free) | 1 (rms_norm) |
+| `train_lora_fp16.ipynb` | T4 16GB (free) | 3 (mmap, grad scaling, rms_norm) |
+| `train_lora.ipynb` | A100 40GB (Pro) | None |
+
+Open in Colab via the VS Code extension or upload directly to [colab.research.google.com](https://colab.research.google.com). Run cells in order.
+
+### 3. Test on-device
+
+```bash
+hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour"
+```
+
+## Training Data
+
+`prepare_data.py` converts the hunch bank into training JSONL:
+
+```bash
+python3 prepare_data.py                        # full bank (~19k train / ~3k eval)
+python3 prepare_data.py --sources override     # overrides only (~96 examples, recommended)
+python3 prepare_data.py --sources tldr-osx     # macOS-specific tldr pages (~1k)
+python3 prepare_data.py --sources override,tldr-osx  # overrides + macOS (~1.1k)
+python3 prepare_data.py --stats                # show dataset statistics
+```
+
+Each training example:
+```json
+[
+  {"role": "system", "content": "Output a single shell command for zsh on macOS..."},
+  {"role": "user", "content": "find files changed in the last hour"},
+  {"role": "assistant", "content": "find . -mmin -60"}
+]
+```
+
+- Benchmark prompts excluded to avoid data leakage
+- Override and tldr-osx entries appear in both splits
+
+**Use `--sources override` for best results.** Adapters trained on ~96 curated overrides (~5 min on T4) significantly outperform adapters trained on the full 19k bank (~1.7 hours on T4). Quality over quantity — see README.md for benchmark results.
+
+## How Each Approach Works
+
+### QLoRA (recommended)
+
+Quantizes the frozen base model to 4-bit NF4 via `bitsandbytes`, and uses `mmap=True` loading to avoid the 12GB CPU RAM spike. Only `nn.Linear` layers are quantized (attention Q/K/V/O, FFN — ~90% of params). Embeddings, norms, and other layers stay in fp16. Adapters train in fp32.
+
+Memory breakdown:
+- CPU RAM peak: **~1GB** (mmap reads weights from disk on demand)
+- Base model Linear layers: ~1.5GB (NF4)
+- Base model non-Linear: ~0.65GB (fp16)
+- Adapters + gradients + optimizer: ~0.6GB (fp32)
+- Activations: ~2-3GB
+- **GPU total: ~5GB**
+
+Only one patch needed: rms_norm dtype fix for mixed fp16/fp32/quantized tensors through norm layers.
+
+### fp16 LoRA
+
+Forces the base model to fp16 and uses `mmap=True` loading. Both changes are patches to Apple's toolkit — the default loads fp32 without mmap, which requires ~24GB CPU RAM and 12GB GPU. Requires three patches total.
+
+Memory breakdown:
+- CPU RAM peak: **~1GB** (mmap, vs ~24GB without)
+- Base model: ~6GB (fp16, vs ~12GB fp32)
+- Adapters + gradients + optimizer: ~0.6GB (fp32)
+- Activations: ~2-3GB
+- **GPU total: ~8.5GB**
+
+**Patch 1 — `utils.py`: mmap + fp16 model + fp32 adapters**
+- `mmap=True` on `torch.load`: reads weights from disk on demand instead of loading 12GB into RAM
+- `model_config.dtype = torch.float16`: creates the model in fp16 (6GB GPU instead of 12GB)
+- Casts adapter weights back to fp32: GradScaler needs fp32 gradients
+
+**Patch 2 — `train_adapter.py`: gradient scaling for f16-mixed**
+- Apple's code only enables GradScaler for a `"f16"` precision mode that isn't exposed as a CLI option
+- When running with `f16-mixed` and an fp16 model, gradients overflow without scaling → loss = NaN
+- Fix: enable GradScaler for `f16-mixed` too
+
+**Patch 3 — `tamm/layers/functional.py`: rms_norm dtype fix**
+- `torch.rms_norm` requires input and weight to have the same dtype
+- fp16 model has fp16 weights, but mixed-precision casts input to fp32
+- Fix: cast weight to match input dtype before calling rms_norm
+
+All patches are applied automatically by the notebook. To restore originals, re-copy from the toolkit on Drive.
+
+### Standard LoRA
+
+Loads the base model in fp32. No patches needed. The ~15GB GPU footprint barely fits a T4 (16GB) with no headroom, but loading crashes first — the 12GB checkpoint must be fully loaded into CPU RAM alongside the model, peaking at ~24GB. T4 only has 12GB system RAM. The A100 works because it has 80GB system RAM.
+
+Memory breakdown:
+- CPU RAM peak: **~24GB** during loading (12GB model + 12GB state dict simultaneously — no mmap)
+- Base model on GPU: ~12GB (fp32)
+- Adapters + gradients + optimizer: ~0.6GB (fp32)
+- Activations: ~2-3GB (fp32)
+- **GPU total: ~15GB**
+
+The CPU RAM spike is why standard LoRA OOMs on a 24GB Mac and on T4 (12GB system RAM). The A100's 80GB system RAM hides this. fp16 LoRA and QLoRA avoid this with `mmap=True` loading (~1GB RAM peak instead of 24GB).
+
+## Export
+
+The export step packages the LoRA weights into a `.fmadapter` file that can be loaded on-device:
+
+```bash
+cd adapter_training_toolkit_v26_0_0
+python3 -m export.export_fmadapter \
+  --adapter-name hunch \
+  --checkpoint ../checkpoints/adapter-final.pt \
+  --output-dir ../exports/
+```
+
+**Note for Mac training:** The training venv has PyTorch 2.11 (from bitsandbytes main) which is too new for coremltools. Export in a separate Python 3.12 environment — see Path A in Quick Start above.
+
+Output is ~130MB. The adapter name can only contain letters, numbers, and underscores.
+
+**Do not modify the export code** — the `.fmadapter` format must match exactly for on-device compatibility.
+
+The `.fmadapter` format doesn't record training precision — adapters trained via QLoRA, fp16 LoRA, or fp32 LoRA all export identically and load the same on-device.
+
+## Loading in Swift
+
+```swift
+let adapter = try SystemLanguageModel.Adapter(fileURL: localURL)
+let model = SystemLanguageModel(adapter: adapter)
+let session = LanguageModelSession(model: model)
+let response = try await session.respond(to: "find files changed in the last hour")
+```
+
+No entitlement needed for local testing. Entitlement required only for App Store distribution.
+
+## Key Training Parameters
+
+| Parameter | Override-only (recommended) | Full bank |
+|-----------|---------------------------|-----------|
+| `--batch-size` | 8 | 8 |
+| `--learning-rate` | 1e-4 | 1e-4 |
+| `--epochs` | 20 | 3 |
+| `--sources` (prepare_data.py) | `override` | (default) |
+
+These apply to all three approaches (LoRA, fp16 LoRA, QLoRA). Override-only trains on ~96 examples and needs more epochs to converge. Full bank has ~19k examples and overfits after 3.
+
+## On-Device Accuracy
+
+All three approaches produce comparable adapters. QLoRA is recommended — same quality, lowest cost.
+
+| Approach | + Retrieval | Standalone | Trained on |
+|---|---|---|---|
+| QLoRA (Mac) | ~86% | ~76% | Local |
+| QLoRA (T4) | ~85% | ~74% | T4 free |
+| LoRA (A100) | ~85% | ~72.5% | A100 |
+| Retrieval only | ~79% | — | — |
+| Bare model | — | ~41% | — |
+
+Full benchmark details and analysis in README.md.
+
+## Known Issues
+
+### Adapter disk space leak
+
+`TGOnDeviceInferenceProviderService` caches a full copy of the adapter (~160MB) in a SIP-protected directory on every process invocation. The copies are never cleaned up. Running benchmarks (hundreds of adapter calls) can consume tens of GB invisibly.
+
+**Workaround:** Use `hunch --batch` to run multiple prompts in a single process (1 cached copy instead of 1 per prompt). To reclaim space, boot Recovery Mode and delete `/Volumes/Data/private/var/db/AppleIntelligencePlatform/AppModelAssets/*`.
+
+To reclaim space, boot Recovery Mode and run `rm -rf /Volumes/Data/private/var/db/AppleIntelligencePlatform/AppModelAssets/*`. The service recreates what it needs on the next adapter load.
+
+## Troubleshooting
+
+**OOM on T4 (QLoRA):** Make sure `bitsandbytes` is installed and the model is being quantized. Check for "Quantized 280 layers to NF4" in the output.
+
+**OOM on T4 (fp16 LoRA):** Make sure all three patches are applied. Run the patch cell before training.
+
+**loss = NaN:** The rms_norm patch didn't apply, or the pycache is stale. The notebook clears pycache automatically, but if you see NaN, restart the kernel and re-run from the patch cell.
+
+**Return code -9:** The OS killed the process for memory. On T4, this means system RAM (12GB) is full. Make sure mmap is patched (check for `mmap=True` in utils.py).
+
+**Adapter name error:** Use only letters, numbers, and underscores. No hyphens.
+
+**coremltools warnings:** Ignore them. The export works despite the warnings.
diff --git a/training/bench_mps.py b/training/bench_mps.py
new file mode 100644
index 0000000..a1bd6c4
--- /dev/null
+++ b/training/bench_mps.py
@@ -0,0 +1,191 @@
+#!/usr/bin/env python3
+"""
+Benchmark QLoRA training on MPS: Metal kernels vs CPU fallback.
+
+Measures load time, training throughput, and memory usage.
+Run with both bitsandbytes versions to compare:
+
+  # With Metal kernels (bitsandbytes from main)
+  python3 bench_mps.py --epochs 3 --label metal
+
+  # Without Metal kernels (bitsandbytes 0.49.2)
+  python3 bench_mps.py --epochs 3 --label cpu-fallback
+
+  # Longer sequences (override + tldr-osx)
+  python3 bench_mps.py --epochs 3 --sources override,tldr-osx --label metal-long
+
+Results are appended to bench_mps_results.jsonl for comparison.
+"""
+
+import sys
+import os
+import gc
+import json
+import time
+import argparse
+import psutil
+from pathlib import Path
+
+TOOLKIT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "adapter_training_toolkit_v26_0_0")
+sys.path.insert(0, TOOLKIT_DIR)
+TRAINING_DIR = Path(__file__).parent
+
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+
+
+def mem_stats():
+    ram = psutil.Process().memory_info().rss / 1024**3
+    gpu = 0
+    if torch.backends.mps.is_available():
+        gpu = torch.mps.current_allocated_memory() / 1024**3
+    elif torch.cuda.is_available():
+        gpu = torch.cuda.memory_allocated() / 1024**3
+    cpu_pct = psutil.cpu_percent(interval=None)
+    return {"ram_gb": round(ram, 2), "gpu_gb": round(gpu, 2), "cpu_pct": cpu_pct}
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--epochs", type=int, default=3)
+    parser.add_argument("--batch-size", type=int, default=8)
+    parser.add_argument("--sources", default="override")
+    parser.add_argument("--label", required=True, help="Label for this run (e.g. 'metal', 'cpu-fallback')")
+    parser.add_argument("--repeat", type=int, default=1, help="Number of full runs to average")
+    args = parser.parse_args()
+
+    # Check bitsandbytes version
+    import bitsandbytes as bnb
+    bnb_version = getattr(bnb, '__version__', 'unknown')
+    print(f"bitsandbytes: {bnb_version}")
+    print(f"Label: {args.label}")
+    print(f"Sources: {args.sources}")
+    print(f"Epochs: {args.epochs}, Batch: {args.batch_size}, Repeats: {args.repeat}")
+    print()
+
+    # Prepare data if needed
+    train_path = TRAINING_DIR / "train.jsonl"
+    if not train_path.exists():
+        os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py --sources {args.sources}")
+    else:
+        # Regenerate with correct sources
+        os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py --sources {args.sources}")
+
+    # Import training components
+    from train_qlora_full import (
+        CommandDataset, collate_fn, load_model_qlora, patch_rms_norm,
+        train_epoch, evaluate
+    )
+    from tamm.tokenizers.afm import AFMTokenizer
+
+    results = []
+
+    for run in range(1, args.repeat + 1):
+        print(f"{'='*60}")
+        print(f"  Run {run}/{args.repeat}")
+        print(f"{'='*60}")
+
+        # Start CPU monitoring
+        psutil.cpu_percent(interval=None)  # reset
+
+        # Phase 1: Load & quantize
+        t_load_start = time.time()
+        patch_rms_norm()
+        device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda")
+        model = load_model_qlora(device)
+        t_load = time.time() - t_load_start
+        mem_after_load = mem_stats()
+        print(f"  Load+quantize: {t_load:.1f}s | {mem_after_load}")
+
+        # Phase 2: Setup data
+        tokenizer = AFMTokenizer(str(Path(TOOLKIT_DIR) / "assets" / "tokenizer.model"))
+        train_dataset = CommandDataset(str(train_path), tokenizer)
+        eval_dataset = CommandDataset(str(TRAINING_DIR / "eval.jsonl"), tokenizer)
+        train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, collate_fn=collate_fn)
+        eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn)
+        print(f"  Data: {len(train_dataset)} train, {len(eval_dataset)} eval, {len(train_loader)} batches/epoch")
+
+        # Phase 3: Train
+        optimizer = torch.optim.AdamW(
+            [p for p in model.parameters() if p.requires_grad],
+            lr=1e-4, weight_decay=0.01
+        )
+        scaler = torch.amp.GradScaler(device=str(device)) if (torch.cuda.is_available() or torch.backends.mps.is_available()) else None
+
+        epoch_times = []
+        epoch_losses = []
+        mem_during_training = []
+
+        for epoch in range(args.epochs):
+            t_epoch_start = time.time()
+            train_loss = train_epoch(model, train_loader, optimizer, device, epoch, scaler)
+            t_epoch = time.time() - t_epoch_start
+            epoch_times.append(t_epoch)
+            epoch_losses.append(train_loss)
+            mem = mem_stats()
+            mem_during_training.append(mem)
+
+            batches = len(train_loader)
+            it_s = batches / t_epoch
+            s_it = t_epoch / batches
+            print(f"  Epoch {epoch+1}: {t_epoch:.1f}s ({s_it:.2f}s/it, {it_s:.2f}it/s) loss={train_loss:.4f} | {mem}")
+
+        # Phase 4: Eval
+        t_eval_start = time.time()
+        eval_loss = evaluate(model, eval_loader, device)
+        t_eval = time.time() - t_eval_start
+        print(f"  Eval: {t_eval:.1f}s loss={eval_loss:.4f}")
+
+        total_time = t_load + sum(epoch_times) + t_eval
+        avg_epoch = sum(epoch_times) / len(epoch_times)
+        avg_it_s = len(train_loader) / avg_epoch
+        avg_s_it = avg_epoch / len(train_loader)
+
+        run_result = {
+            "label": args.label,
+            "run": run,
+            "bnb_version": bnb_version,
+            "sources": args.sources,
+            "epochs": args.epochs,
+            "batch_size": args.batch_size,
+            "train_examples": len(train_dataset),
+            "batches_per_epoch": len(train_loader),
+            "load_time_s": round(t_load, 1),
+            "avg_epoch_s": round(avg_epoch, 1),
+            "avg_s_per_it": round(avg_s_it, 2),
+            "avg_it_per_s": round(avg_it_s, 2),
+            "total_time_s": round(total_time, 1),
+            "final_train_loss": round(epoch_losses[-1], 4),
+            "eval_loss": round(eval_loss, 4),
+            "mem_after_load": mem_after_load,
+            "mem_training": mem_during_training[-1],
+            "epoch_times": [round(t, 1) for t in epoch_times],
+        }
+        results.append(run_result)
+
+        print(f"\n  Summary: {avg_s_it:.2f}s/it ({avg_it_s:.2f}it/s), total {total_time:.0f}s")
+        print()
+
+        # Cleanup for next run
+        del model, optimizer, scaler, train_loader, eval_loader
+        gc.collect()
+        if torch.backends.mps.is_available():
+            torch.mps.empty_cache()
+
+    # Save results
+    results_file = TRAINING_DIR / "bench_mps_results.jsonl"
+    with open(results_file, "a") as f:
+        for r in results:
+            f.write(json.dumps(r) + "\n")
+    print(f"Results appended to {results_file}")
+
+    # Print comparison-ready summary
+    if len(results) > 1:
+        avg_it = sum(r["avg_s_per_it"] for r in results) / len(results)
+        avg_total = sum(r["total_time_s"] for r in results) / len(results)
+        print(f"\nAverage across {len(results)} runs: {avg_it:.2f}s/it, {avg_total:.0f}s total")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/training/prepare_data.py b/training/prepare_data.py
new file mode 100644
index 0000000..dea923c
--- /dev/null
+++ b/training/prepare_data.py
@@ -0,0 +1,193 @@
+#!/usr/bin/env python3
+"""Convert the hunch bank into training data for Apple FM adapter training.
+
+Produces JSONL files in the format expected by Apple's adapter training toolkit:
+  [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
+
+Usage:
+  python3 prepare_data.py                    # generate train.jsonl + eval.jsonl
+  python3 prepare_data.py --stats            # show dataset statistics
+  python3 prepare_data.py --eval-split 0.1   # 10% eval split (default)
+"""
+
+import json
+import sqlite3
+import random
+import argparse
+from pathlib import Path
+
+BANK_DB = Path(__file__).parent.parent / "bank" / "tldr_bank.db"
+BENCHMARK_PROMPTS = Path(__file__).parent.parent / "benchmark" / "prompts.jsonl"
+TRAIN_FILE = Path(__file__).parent / "train.jsonl"
+EVAL_FILE = Path(__file__).parent / "eval.jsonl"
+
+SYSTEM_PROMPT = "Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command."
+
+
+def load_bank():
+    """Load all Q/A pairs from the bank."""
+    conn = sqlite3.connect(str(BANK_DB))
+    rows = conn.execute(
+        "SELECT question, answer, cmd, source FROM bank"
+    ).fetchall()
+    conn.close()
+    return [{"q": q, "a": a, "cmd": cmd, "source": src} for q, a, cmd, src in rows]
+
+
+def load_benchmark_prompts():
+    """Load benchmark prompts to exclude from training data."""
+    if not BENCHMARK_PROMPTS.exists():
+        return set()
+    prompts = set()
+    with open(BENCHMARK_PROMPTS) as f:
+        for line in f:
+            p = json.loads(line)
+            prompts.add(p["prompt"].lower().strip())
+    return prompts
+
+
+def to_training_example(entry):
+    """Convert a bank entry to Apple FM training format."""
+    return [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": entry["q"]},
+        {"role": "assistant", "content": entry["a"]},
+    ]
+
+
+def prepare_dataset(eval_split=0.1, exclude_benchmark=True, seed=42, sources=None):
+    """Prepare train/eval splits from the bank.
+
+    Args:
+        sources: filter by source. Options:
+            None or "all" — everything (default)
+            "override" — overrides only (~130 examples)
+            "macos" — overrides + tldr-osx (~1k examples)
+            "override,tldr-osx" — comma-separated list
+    """
+    bank = load_bank()
+    print(f"Loaded {len(bank)} entries from bank")
+
+    # Filter by source if specified
+    if sources and sources != "all":
+        allowed = set(s.strip() for s in sources.split(","))
+        # "macos" is a shorthand for override + tldr-osx
+        if "macos" in allowed:
+            allowed.discard("macos")
+            allowed.update(["override", "tldr-osx"])
+        before = len(bank)
+        bank = [e for e in bank if e["source"] in allowed]
+        print(f"Filtered to sources {allowed}: {len(bank)} entries (from {before})")
+
+    # Count by source
+    by_source = {}
+    for entry in bank:
+        by_source[entry["source"]] = by_source.get(entry["source"], 0) + 1
+    for src, count in sorted(by_source.items()):
+        print(f"  {src}: {count}")
+
+    # Exclude benchmark prompts from training to avoid data leakage
+    if exclude_benchmark:
+        benchmark = load_benchmark_prompts()
+        before = len(bank)
+        bank = [e for e in bank if e["q"].lower().strip() not in benchmark]
+        excluded = before - len(bank)
+        print(f"Excluded {excluded} entries matching benchmark prompts")
+
+    # Deduplicate by (question, answer)
+    seen = set()
+    unique = []
+    for entry in bank:
+        key = (entry["q"].lower().strip(), entry["a"].strip())
+        if key not in seen:
+            seen.add(key)
+            unique.append(entry)
+    print(f"After dedup: {len(unique)} unique entries (removed {len(bank) - len(unique)})")
+    bank = unique
+
+    # Split into train/eval
+    random.seed(seed)
+    random.shuffle(bank)
+    eval_size = max(int(len(bank) * eval_split), 1)
+    eval_data = bank[:eval_size]
+    train = bank[eval_size:]
+
+    # For small datasets, put everything in both
+    if len(bank) < 500:
+        train = bank
+        eval_data = bank
+        print(f"Small dataset — using all {len(bank)} examples for both train and eval")
+    else:
+        print(f"\nDataset split:")
+        print(f"  Train: {len(train)} examples")
+        print(f"  Eval:  {len(eval_data)} examples")
+
+    return train, eval_data
+
+
+def write_jsonl(data, path):
+    """Write training data in Apple FM format."""
+    with open(path, "w") as f:
+        for entry in data:
+            example = to_training_example(entry)
+            f.write(json.dumps(example) + "\n")
+    print(f"Wrote {len(data)} examples to {path}")
+
+
+def show_stats(data, label):
+    """Show dataset statistics."""
+    by_source = {}
+    by_cmd = {}
+    total_q_len = 0
+    total_a_len = 0
+
+    for entry in data:
+        by_source[entry["source"]] = by_source.get(entry["source"], 0) + 1
+        by_cmd[entry["cmd"]] = by_cmd.get(entry["cmd"], 0) + 1
+        total_q_len += len(entry["q"])
+        total_a_len += len(entry["a"])
+
+    print(f"\n{label} ({len(data)} examples):")
+    print(f"  By source:")
+    for src, count in sorted(by_source.items(), key=lambda x: -x[1]):
+        print(f"    {src}: {count}")
+    print(f"  Unique commands: {len(by_cmd)}")
+    print(f"  Avg question length: {total_q_len / len(data):.0f} chars")
+    print(f"  Avg answer length: {total_a_len / len(data):.0f} chars")
+    print(f"  Top commands:")
+    for cmd, count in sorted(by_cmd.items(), key=lambda x: -x[1])[:10]:
+        print(f"    {cmd}: {count}")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--eval-split", type=float, default=0.1)
+    parser.add_argument("--stats", action="store_true")
+    parser.add_argument("--no-exclude-benchmark", action="store_true")
+    parser.add_argument("--sources", default=None, help="Filter sources: override, macos, tldr-osx, tldr-common, or all")
+    args = parser.parse_args()
+
+    train, eval_data = prepare_dataset(
+        eval_split=args.eval_split,
+        exclude_benchmark=not args.no_exclude_benchmark,
+        sources=args.sources,
+    )
+
+    if args.stats:
+        show_stats(train, "Train")
+        show_stats(eval_data, "Eval")
+    else:
+        write_jsonl(train, TRAIN_FILE)
+        write_jsonl(eval_data, EVAL_FILE)
+
+    # Show a few examples
+    print("\nSample training examples:")
+    for entry in train[:3]:
+        ex = to_training_example(entry)
+        print(f"  user: {ex[1]['content'][:60]}")
+        print(f"  asst: {ex[2]['content'][:60]}")
+        print()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/training/train_lora.ipynb b/training/train_lora.ipynb
new file mode 100644
index 0000000..2291e72
--- /dev/null
+++ b/training/train_lora.ipynb
@@ -0,0 +1,322 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LoRA: Training Apple's 3B Model on A100\n",
+    "\n",
+    "Standard LoRA training using Apple's adapter toolkit. Requires A100 (40GB GPU).\n",
+    "For free T4 training, see `train_qlora.ipynb`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "Upload to `My Drive/hunch-training/`:\n",
+    "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n",
+    "- `prepare_data.py`, `tldr_bank.db`, `prompts.jsonl`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mounted at /content/drive\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m117.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m40.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m10.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m93.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hCUDA: True\n",
+      "GPU: NVIDIA A100-SXM4-40GB\n"
+     ]
+    }
+   ],
+   "source": [
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')\n",
+    "\n",
+    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
+    "WORK_DIR = '/content/hunch-training'\n",
+    "\n",
+    "!mkdir -p {WORK_DIR}\n",
+    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
+    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
+    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
+    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
+    "\n",
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n",
+    "\n",
+    "import torch\n",
+    "print(f'CUDA: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded 21478 entries from bank\n",
+      "Filtered to sources {'override'}: 134 entries (from 21478)\n",
+      "  override: 134\n",
+      "Excluded 38 entries matching benchmark prompts\n",
+      "After dedup: 96 unique entries (removed 0)\n",
+      "Small dataset — using all 96 examples for both train and eval\n",
+      "Wrote 96 examples to /content/hunch-training/train.jsonl\n",
+      "Wrote 96 examples to /content/hunch-training/eval.jsonl\n",
+      "\n",
+      "Sample training examples:\n",
+      "  user: show response headers\n",
+      "  asst: curl -I https://example.com\n",
+      "\n",
+      "  user: dns lookup for a domain\n",
+      "  asst: dig example.com\n",
+      "\n",
+      "  user: record shell session to file\n",
+      "  asst: script session.log\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py --sources override"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Train\n",
+    "\n",
+    "No patches needed for A100. ~25 min/epoch, ~1.5 hours total.\n",
+    "\n",
+    "**Note:** lr=1e-3 diverged in testing. Use 1e-4."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Fine-tuning adapters with configuration: \n",
+      "AdapterTrainingConfiguration(epochs=20, learning_rate=0.0001, batch_size=8, linear_warmup_epochs=1, gradient_accumulation_steps=1, enable_activation_checkpointing=True, precision='bf16-mixed', compile_model=False, weight_decay=0.01, clip_grad_norm=1.0, max_sequence_length=None, fixed_sized_sequences=False, pack_sequences=False, loss_update_frequency=3)\n",
+      "Loading base model on cuda with precision torch.float32\n",
+      "/usr/local/lib/python3.12/dist-packages/tamm/layers/flash_attention.py:78: UserWarning: Failed to import flash-attn for Flash attention. Using flash attention may lead to significantly faster training. Please refer to tamm-scripts/install_flash_attn.sh for instructions.\n",
+      "  _warnings.warn(\n",
+      "Total parameters 3178001792\n",
+      "Total trainable parameters 66633728\n",
+      "Gradient scaling is enabled: False\n",
+      "Epoch 1/20\n",
+      "Training: 100% 12/12 [00:08<00:00,  1.42it/s, loss=1.64]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.12it/s, loss=1.05]\n",
+      "Epoch 2/20\n",
+      "INFO:examples.utils:Epoch 2/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.795]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.26it/s, loss=0.32] \n",
+      "Epoch 3/20\n",
+      "INFO:examples.utils:Epoch 3/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.283]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.27it/s, loss=0.116]\n",
+      "Epoch 4/20\n",
+      "INFO:examples.utils:Epoch 4/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.0817]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.25it/s, loss=0.0388] \n",
+      "Epoch 5/20\n",
+      "INFO:examples.utils:Epoch 5/20\n",
+      "Training: 100% 12/12 [00:06<00:00,  1.72it/s, loss=0.0895]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.27it/s, loss=0.0261] \n",
+      "Epoch 6/20\n",
+      "INFO:examples.utils:Epoch 6/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.0223]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.28it/s, loss=0.0127]\n",
+      "Epoch 7/20\n",
+      "INFO:examples.utils:Epoch 7/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.0104]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.23it/s, loss=0.0147]\n",
+      "Epoch 8/20\n",
+      "INFO:examples.utils:Epoch 8/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.0163] \n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.37it/s, loss=0.00656]\n",
+      "Epoch 9/20\n",
+      "INFO:examples.utils:Epoch 9/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.0194] \n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.26it/s, loss=0.000864]\n",
+      "Epoch 10/20\n",
+      "INFO:examples.utils:Epoch 10/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.000877]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.29it/s, loss=0.000607]\n",
+      "Epoch 11/20\n",
+      "INFO:examples.utils:Epoch 11/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.000526]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.22it/s, loss=0.000396]\n",
+      "Epoch 12/20\n",
+      "INFO:examples.utils:Epoch 12/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.000395]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.32it/s, loss=0.000287]\n",
+      "Epoch 13/20\n",
+      "INFO:examples.utils:Epoch 13/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.71it/s, loss=0.00031]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.26it/s, loss=0.000221]\n",
+      "Epoch 14/20\n",
+      "INFO:examples.utils:Epoch 14/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.000229]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.34it/s, loss=0.000198]\n",
+      "Epoch 15/20\n",
+      "INFO:examples.utils:Epoch 15/20\n",
+      "Training: 100% 12/12 [00:06<00:00,  1.72it/s, loss=0.000201]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.30it/s, loss=0.000169]\n",
+      "Epoch 16/20\n",
+      "INFO:examples.utils:Epoch 16/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.000196]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.19it/s, loss=0.000161]\n",
+      "Epoch 17/20\n",
+      "INFO:examples.utils:Epoch 17/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.000155]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.25it/s, loss=0.000165]\n",
+      "Epoch 18/20\n",
+      "INFO:examples.utils:Epoch 18/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.68it/s, loss=0.000159]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.22it/s, loss=0.000159]\n",
+      "Epoch 19/20\n",
+      "INFO:examples.utils:Epoch 19/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.70it/s, loss=0.00016] \n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.30it/s, loss=0.000156]\n",
+      "Epoch 20/20\n",
+      "INFO:examples.utils:Epoch 20/20\n",
+      "Training: 100% 12/12 [00:07<00:00,  1.69it/s, loss=0.000163]\n",
+      "Evaluation: 100% 12/12 [00:02<00:00,  5.28it/s, loss=0.000156]\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n",
+    "  --train-data ../train.jsonl \\\n",
+    "  --eval-data ../eval.jsonl \\\n",
+    "  --epochs 20 \\\n",
+    "  --learning-rate 1e-4 \\\n",
+    "  --batch-size 8 \\\n",
+    "  --precision bf16-mixed \\\n",
+    "  --activation-checkpointing \\\n",
+    "  --checkpoint-dir ../lora-override-checkpoints/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Save checkpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Checkpoints saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cp -r {WORK_DIR}/lora-override-checkpoints {DRIVE_DIR}/lora-override-checkpoints\n",
+    "!echo 'Checkpoints saved to Drive'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Export .fmadapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "scikit-learn version 1.6.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.\n",
+      "XGBoost version 3.2.0 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.\n",
+      "2026-04-15 17:21:10.930166: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
+      "2026-04-15 17:21:10.949095: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
+      "E0000 00:00:1776273670.972532    4269 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "E0000 00:00:1776273670.980305    4269 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "W0000 00:00:1776273671.000652    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776273671.000678    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776273671.000681    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776273671.000684    4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "2026-04-15 17:21:11.005962: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.\n",
+      "Torch version 2.10.0+cu128 has not been tested with coremltools. You may run into unexpected errors. Torch 2.5.0 is the most recent version that has been tested.\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'\n",
+      "total 4.0K\n",
+      "drwxr-xr-x 2 root root 4.0K Apr 15 17:21 hunch.fmadapter\n",
+      "Adapter exported and saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n",
+    "  --adapter-name hunch \\\n",
+    "  --checkpoint ../lora-override-checkpoints/adapter-final.pt \\\n",
+    "  --output-dir ../lora-override-exports/\n",
+    "\n",
+    "!ls -lh {WORK_DIR}/lora-override-exports/\n",
+    "!cp -r {WORK_DIR}/lora-override-exports {DRIVE_DIR}/lora-override-exports\n",
+    "!echo 'Adapter exported and saved to Drive'"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/training/train_lora_fp16.ipynb b/training/train_lora_fp16.ipynb
new file mode 100644
index 0000000..f8e498f
--- /dev/null
+++ b/training/train_lora_fp16.ipynb
@@ -0,0 +1,228 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# fp16 LoRA: Training Apple's 3B Model on a Free T4\n\nThree patches to Apple's adapter training toolkit enable training on Colab's free T4 GPU (16GB):\n\n1. **mmap loading** — reads weights from disk on demand, avoids 12GB system RAM spike\n2. **fp16 model + fp32 adapters** — halves GPU memory from 12GB to 6GB\n3. **rms_norm fix + gradient scaling** — fixes dtype mismatches that cause NaN\n\nResult: ~2 hours training on free T4. This is half-precision LoRA (fp16 base), not true QLoRA (4-bit NF4)."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "Upload to `My Drive/hunch-training/`:\n",
+    "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n",
+    "- `prepare_data.py`, `tldr_bank.db`, `prompts.jsonl`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')\n",
+    "\n",
+    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
+    "WORK_DIR = '/content/hunch-training'\n",
+    "\n",
+    "!mkdir -p {WORK_DIR}\n",
+    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
+    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
+    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
+    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
+    "\n",
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n",
+    "\n",
+    "import torch\n",
+    "print(f'CUDA: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Apply patches\n",
+    "\n",
+    "Three patches make training fit on T4 (16GB GPU, 12GB RAM):\n",
+    "\n",
+    "**Patch 1 — `utils.py`:** mmap loading (0 RAM), fp16 model (6GB GPU), fp32 adapters (stable gradients)\n",
+    "\n",
+    "**Patch 2 — `train_adapter.py`:** enable gradient scaling for f16-mixed (prevents NaN overflow)\n",
+    "\n",
+    "**Patch 3 — `tamm/layers/functional.py`:** cast rms_norm weight to match input dtype (prevents NaN from dtype mismatch)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob, shutil\n",
+    "\n",
+    "# --- Patch 1: utils.py ---\n",
+    "# Restore clean copy first\n",
+    "!cp {DRIVE_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py \\\n",
+    "    {WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py\n",
+    "\n",
+    "utils_path = f'{WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py'\n",
+    "code = open(utils_path).read()\n",
+    "\n",
+    "# 1a: Force fp16 model creation (6GB instead of 12GB on GPU)\n",
+    "code = code.replace(\n",
+    "    'model_config.dtype = dtype or model_config.dtype',\n",
+    "    'model_config.dtype = torch.float16'\n",
+    ")\n",
+    "\n",
+    "# 1b: mmap loading (weights stay on disk, ~0 system RAM)\n",
+    "code = code.replace(\n",
+    "    '''    with Path(base_model_checkpoint_path).open(\"rb\") as f:\\n        sd = torch.load(f, map_location=device, weights_only=False)\\n        _ = model.load_state_dict(sd, strict=True)''',\n",
+    "    '''    sd = torch.load(str(base_model_checkpoint_path), map_location=device, mmap=True, weights_only=False)\\n    _ = model.load_state_dict(sd, strict=True)\\n    del sd; import gc; gc.collect()'''\n",
+    ")\n",
+    "\n",
+    "# 1c: Keep adapter weights in fp32 (GradScaler needs fp32 gradients)\n",
+    "code = code.replace(\n",
+    "    '    return model.to(device=device, dtype=model_config.dtype)',\n",
+    "    '''    model = model.to(device=device, dtype=model_config.dtype)\n",
+    "\n",
+    "    # Keep adapter weights in fp32 for stable training\n",
+    "    for name, parameter in model.named_parameters():\n",
+    "        if \"adapter\" in name:\n",
+    "            parameter.data = parameter.data.float()\n",
+    "\n",
+    "    return model'''\n",
+    ")\n",
+    "\n",
+    "open(utils_path, 'w').write(code)\n",
+    "print('Patch 1 applied: utils.py (mmap + fp16 + fp32 adapters)')\n",
+    "\n",
+    "# --- Patch 2: train_adapter.py ---\n",
+    "!cp {DRIVE_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py \\\n",
+    "    {WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py\n",
+    "\n",
+    "ta_path = f'{WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py'\n",
+    "code = open(ta_path).read()\n",
+    "code = code.replace(\n",
+    "    'return self.precision == \"f16\"',\n",
+    "    'return self.precision in (\"f16\", \"f16-mixed\")'\n",
+    ")\n",
+    "open(ta_path, 'w').write(code)\n",
+    "print('Patch 2 applied: train_adapter.py (gradient scaling for f16-mixed)')\n",
+    "\n",
+    "# --- Patch 3: tamm rms_norm ---\n",
+    "norm_files = glob.glob(f'{WORK_DIR}/**/tamm/layers/functional.py', recursive=True)\n",
+    "norm_files += glob.glob('/usr/local/lib/**/tamm/layers/functional.py', recursive=True)\n",
+    "for nf in norm_files:\n",
+    "    code = open(nf).read()\n",
+    "    if 'weight.to(tensor.dtype)' not in code:\n",
+    "        old = '        tensor = _torch_compatibility.rms_norm(\\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\\n        )'\n",
+    "        new = '        if weight is not None and weight.dtype != tensor.dtype:\\n            weight = weight.to(tensor.dtype)\\n        tensor = _torch_compatibility.rms_norm(\\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\\n        )'\n",
+    "        code = code.replace(old, new)\n",
+    "        open(nf, 'w').write(code)\n",
+    "        print(f'Patch 3 applied: {nf} (rms_norm dtype fix)')\n",
+    "    else:\n",
+    "        print(f'Patch 3 already applied: {nf}')\n",
+    "\n",
+    "# Clear pycache\n",
+    "for d in glob.glob(f'{WORK_DIR}/**/tamm/**/__pycache__', recursive=True):\n",
+    "    shutil.rmtree(d, ignore_errors=True)\n",
+    "for d in glob.glob('/usr/local/lib/**/tamm/**/__pycache__', recursive=True):\n",
+    "    shutil.rmtree(d, ignore_errors=True)\n",
+    "print('\\nAll patches applied. Ready to train.')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Train\n",
+    "\n",
+    "~40 min/epoch on T4, ~2 hours total for 3 epochs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n  --train-data ../train.jsonl \\\n  --eval-data ../eval.jsonl \\\n  --epochs 3 \\\n  --learning-rate 1e-4 \\\n  --batch-size 8 \\\n  --precision f16-mixed \\\n  --activation-checkpointing \\\n  --checkpoint-dir ../fp16-lora-checkpoints/"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Save checkpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cp -r {WORK_DIR}/fp16-lora-checkpoints {DRIVE_DIR}/\n!echo 'Checkpoints saved to Drive'"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Evaluate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import json, subprocess\n\ntest_prompts = [\n    'find files changed in the last hour',\n    'show disk usage',\n    'generate a random password',\n    'kill a process by name',\n    'show http headers of a url',\n    'record terminal session',\n    'find files larger than 100mb',\n    'convert image to different format',\n    'show all listening ports',\n    'find files modified in the last 7 days',\n    'find files owned by root',\n    'count lines in all python files',\n    'show all environment variables',\n    'clear the terminal',\n    'compare two files',\n]\n\nsystem = 'Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command.'\n\nwith open(f'{WORK_DIR}/test_prompts.jsonl', 'w') as f:\n    for p in test_prompts:\n        f.write(json.dumps([\n            {'role': 'system', 'content': system},\n            {'role': 'user', 'content': p}\n        ]) + '\\n')\n\nresult = subprocess.run(\n    ['python3', '-m', 'examples.generate',\n     '--prompt', '../test_prompts.jsonl',\n     '--checkpoint', '../fp16-lora-checkpoints/adapter-final.pt',\n     '--precision', 'f16-mixed'],\n    capture_output=True, text=True,\n    cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n)\n\nlines = (result.stdout + result.stderr).strip().split('\\n')\nidx = 0\nfor line in lines:\n    if 'Response for prompt' in line:\n        answer = line.split(': ', 2)[-1].replace('<turn_end>', '').strip()\n        prompt = test_prompts[idx] if idx < len(test_prompts) else '?'\n        print(f'Q: {prompt:<45} A: {answer}')\n        idx += 1\n\nif idx == 0:\n    print('No output. Check error:')\n    print('STDERR:', result.stderr[-500:])\n    print('Return code:', result.returncode)"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Export .fmadapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n  --adapter-name hunch_fp16 \\\n  --checkpoint ../fp16-lora-checkpoints/adapter-final.pt \\\n  --output-dir ../fp16-lora-exports/\n\n!ls -lh {WORK_DIR}/fp16-lora-exports/\n!cp -r {WORK_DIR}/fp16-lora-exports {DRIVE_DIR}/\n!echo 'Adapter exported and saved to Drive'"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/training/train_qlora.ipynb b/training/train_qlora.ipynb
new file mode 100644
index 0000000..1f6b5c9
--- /dev/null
+++ b/training/train_qlora.ipynb
@@ -0,0 +1,344 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# True QLoRA: Training Apple's 3B Model with 4-bit NF4\n",
+    "\n",
+    "Uses bitsandbytes NF4 quantization on the frozen base model.\n",
+    "Only ~5GB GPU memory — fits on free T4 with headroom.\n",
+    "\n",
+    "This is proper QLoRA as defined by [Dettmers et al. 2023](https://arxiv.org/abs/2305.14314):\n",
+    "4-bit quantized base + fp32 LoRA adapters."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Setup\n",
+    "\n",
+    "Upload to `My Drive/hunch-training/`:\n",
+    "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n",
+    "- `prepare_data.py`, `train_qlora_full.py`, `tldr_bank.db`, `prompts.jsonl`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Mounted at /content/drive\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m19.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m39.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m58.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[2K   \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n",
+      "\u001b[?25hCUDA: True\n",
+      "GPU: Tesla T4\n"
+     ]
+    }
+   ],
+   "source": [
+    "from google.colab import drive\n",
+    "drive.mount('/content/drive')\n",
+    "\n",
+    "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n",
+    "WORK_DIR = '/content/hunch-training'\n",
+    "\n",
+    "!mkdir -p {WORK_DIR}\n",
+    "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n",
+    "!cp {DRIVE_DIR}/train_qlora_full.py {WORK_DIR}/\n",
+    "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n",
+    "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n",
+    "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n",
+    "\n",
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n",
+    "!pip install bitsandbytes -q\n",
+    "\n",
+    "import torch\n",
+    "print(f'CUDA: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Prepare training data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded 21478 entries from bank\n",
+      "Filtered to sources {'override'}: 134 entries (from 21478)\n",
+      "  override: 134\n",
+      "Excluded 38 entries matching benchmark prompts\n",
+      "After dedup: 96 unique entries (removed 0)\n",
+      "Small dataset — using all 96 examples for both train and eval\n",
+      "Wrote 96 examples to /content/hunch-training/train.jsonl\n",
+      "Wrote 96 examples to /content/hunch-training/eval.jsonl\n",
+      "\n",
+      "Sample training examples:\n",
+      "  user: show response headers\n",
+      "  asst: curl -I https://example.com\n",
+      "\n",
+      "  user: dns lookup for a domain\n",
+      "  asst: dig example.com\n",
+      "\n",
+      "  user: record shell session to file\n",
+      "  asst: script session.log\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR} && python3 prepare_data.py --sources override"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Train\n",
+    "\n",
+    "No patches needed — `train_qlora_full.py` handles everything:\n",
+    "mmap loading, NF4 quantization, training loop.\n",
+    "\n",
+    "~5GB GPU memory. Can use large batch sizes on T4."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Device: cuda | RAM=0.6GB GPU=0.0GB\n",
+      "Quantized 280 layers to NF4\n",
+      "Trainable: 67M params | RAM=6.3GB GPU=2.2GB\n",
+      "Train: 96 examples, 12 batches\n",
+      "Eval: 96 examples\n",
+      "\n",
+      "============================================================\n",
+      "Training: 20 epochs, batch 8, lr 0.0001\n",
+      "============================================================\n",
+      "\n",
+      "Epoch 1/20\n",
+      "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = c10::Half, Cannot dispatch to fused implementation. (Triggered internally at /pytorch/aten/src/ATen/native/layer_norm.cpp:344.)\n",
+      "  return torch.rms_norm(input, normalized_shape, weight, eps)\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch1.pt\n",
+      "  Train loss: 1.4963 | Eval loss: 0.7162 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 2/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch2.pt\n",
+      "  Train loss: 0.5486 | Eval loss: 0.2153 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 3/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch3.pt\n",
+      "  Train loss: 0.1835 | Eval loss: 0.0547 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 4/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch4.pt\n",
+      "  Train loss: 0.0840 | Eval loss: 0.0401 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 5/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch5.pt\n",
+      "  Train loss: 0.0463 | Eval loss: 0.0093 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 6/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch6.pt\n",
+      "  Train loss: 0.0166 | Eval loss: 0.0046 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 7/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch7.pt\n",
+      "  Train loss: 0.0043 | Eval loss: 0.0013 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 8/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch8.pt\n",
+      "  Train loss: 0.0013 | Eval loss: 0.0003 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 9/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch9.pt\n",
+      "  Train loss: 0.0003 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 10/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch10.pt\n",
+      "  Train loss: 0.0001 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 11/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch11.pt\n",
+      "  Train loss: 0.0001 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 12/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch12.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 13/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch13.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 14/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch14.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 15/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch15.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 16/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch16.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 17/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch17.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.8GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 18/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch18.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 19/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch19.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n",
+      "\n",
+      "Epoch 20/20\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch20.pt\n",
+      "  Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.8GB GPU=3.0GB\n",
+      "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-final.pt\n",
+      "\n",
+      "Done! Export with:\n",
+      "  python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint qlora-override-checkpoints/adapter-final.pt --output-dir qlora-override-checkpoints//\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR} && python3 train_qlora_full.py \\\n",
+    "  --epochs 20 \\\n",
+    "  --batch-size 8 \\\n",
+    "  --learning-rate 1e-4 \\\n",
+    "  --checkpoint-dir qlora-override-checkpoints/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Save checkpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Checkpoints saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cp -r {WORK_DIR}/qlora-override-checkpoints/ {DRIVE_DIR}/qlora-override-checkpoints\n",
+    "!echo 'Checkpoints saved to Drive'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Export .fmadapter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "scikit-learn version 1.6.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.\n",
+      "XGBoost version 3.2.0 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.\n",
+      "2026-04-18 01:46:36.123769: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
+      "E0000 00:00:1776476796.352370    4085 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "E0000 00:00:1776476796.414439    4085 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "W0000 00:00:1776476796.851699    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776476796.851750    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776476796.851754    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "W0000 00:00:1776476796.851758    4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n",
+      "2026-04-18 01:46:36.891354: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.\n",
+      "Torch version 2.10.0+cu128 has not been tested with coremltools. You may run into unexpected errors. Torch 2.5.0 is the most recent version that has been tested.\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n",
+      "WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'\n",
+      "total 4.0K\n",
+      "drwxr-xr-x 2 root root 4.0K Apr 18 01:46 hunch_qlora.fmadapter\n",
+      "Adapter exported and saved to Drive\n"
+     ]
+    }
+   ],
+   "source": [
+    "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n",
+    "  --adapter-name hunch_qlora \\\n",
+    "  --checkpoint ../qlora-override-checkpoints/adapter-final.pt \\\n",
+    "  --output-dir ../qlora-override-exports/\n",
+    "\n",
+    "!ls -lh {WORK_DIR}/qlora-override-exports/\n",
+    "!cp -r {WORK_DIR}/qlora-override-exports {DRIVE_DIR}/qlora-override-exports\n",
+    "!echo 'Adapter exported and saved to Drive'"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/training/train_qlora_full.py b/training/train_qlora_full.py
new file mode 100644
index 0000000..780113f
--- /dev/null
+++ b/training/train_qlora_full.py
@@ -0,0 +1,376 @@
+#!/usr/bin/env python3
+"""
+True QLoRA training: 4-bit NF4 base model + fp32 LoRA adapters.
+
+Uses bitsandbytes for NF4 quantization. Trains on hunch dataset.
+Works on 24GB Mac (MPS) and Colab T4 (CUDA). ~5GB GPU memory.
+
+Usage:
+  python3 train_qlora_full.py                          # train 3 epochs
+  python3 train_qlora_full.py --epochs 1 --batch-size 4  # quick test
+  python3 train_qlora_full.py --eval-only --checkpoint checkpoints/adapter-final.pt
+
+Requirements:
+  pip install bitsandbytes psutil
+"""
+
+import sys
+import os
+import gc
+import json
+import time
+import argparse
+import psutil
+from pathlib import Path
+
+TOOLKIT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "adapter_training_toolkit_v26_0_0")
+sys.path.insert(0, TOOLKIT_DIR)
+
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+import tamm.utils.json
+from tamm.tokenizers.afm import AFMTokenizer
+
+ASSETS = Path(TOOLKIT_DIR) / "assets"
+TRAINING_DIR = Path(__file__).parent
+
+
+def patch_rms_norm():
+    """Patch tamm's rms_norm to handle dtype mismatch (fp16 model + fp32 cast)."""
+    import glob
+    patterns = [
+        os.path.join(TOOLKIT_DIR, "venv", "lib", "*", "site-packages", "tamm", "layers", "functional.py"),
+        os.path.join(sys.prefix, "lib", "*", "dist-packages", "tamm", "layers", "functional.py"),
+    ]
+    for pattern in patterns:
+        for path in glob.glob(pattern):
+            code = open(path).read()
+            if "weight.to(tensor.dtype)" not in code:
+                old = "        tensor = _torch_compatibility.rms_norm(\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\n        )"
+                new = "        if weight is not None and weight.dtype != tensor.dtype:\n            weight = weight.to(tensor.dtype)\n        tensor = _torch_compatibility.rms_norm(\n            tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\n        )"
+                code = code.replace(old, new)
+                open(path, "w").write(code)
+                # Clear pycache
+                cache_dir = os.path.join(os.path.dirname(path), "__pycache__")
+                if os.path.exists(cache_dir):
+                    import shutil; shutil.rmtree(cache_dir)
+                print(f"Patched rms_norm: {path}")
+            else:
+                print(f"rms_norm already patched: {path}")
+
+
+def get_device():
+    if torch.cuda.is_available():
+        return torch.device("cuda")
+    if torch.backends.mps.is_available():
+        return torch.device("mps")
+    return torch.device("cpu")
+
+
+def mem_str():
+    ram = psutil.Process().memory_info().rss / 1024**3
+    if torch.cuda.is_available():
+        gpu = torch.cuda.memory_allocated() / 1024**3
+    elif torch.backends.mps.is_available():
+        gpu = torch.mps.current_allocated_memory() / 1024**3
+    else:
+        gpu = 0
+    return f"RAM={ram:.1f}GB GPU={gpu:.1f}GB"
+
+
+def load_model_qlora(device):
+    """Load base model with NF4 quantization."""
+    import bitsandbytes as bnb
+
+    # Load config and create model in fp16 (6GB instead of 12GB)
+    with open(ASSETS / "base-model-config.json") as f:
+        config = tamm.utils.json.load(f)
+    config.dtype = torch.float16
+    model = config.create_model()
+
+    # Load weights via mmap (minimal RAM)
+    sd = torch.load(str(ASSETS / "base-model.pt"), map_location="cpu", mmap=True, weights_only=False)
+    model.load_state_dict(sd, strict=True)
+    del sd; gc.collect()
+
+    # Freeze non-adapter params
+    for name, param in model.named_parameters():
+        param.requires_grad = "adapter" in name
+
+    # Quantize frozen Linear layers to NF4
+    replacements = []
+    for name, module in model.named_modules():
+        if not isinstance(module, nn.Linear):
+            continue
+        if "adapter" in name or any(p.requires_grad for p in module.parameters()):
+            continue
+        replacements.append((name, module))
+
+    for name, module in replacements:
+        new_module = bnb.nn.Linear4bit(
+            module.in_features, module.out_features,
+            bias=module.bias is not None,
+            compute_dtype=torch.float16,
+            quant_type="nf4",
+        )
+        new_module.weight = bnb.nn.Params4bit(
+            module.weight.data, requires_grad=False,
+            quant_type="nf4", compress_statistics=torch.cuda.is_available(),
+        )
+        if module.bias is not None:
+            new_module.bias = module.bias
+
+        parts = name.rsplit(".", 1)
+        if len(parts) == 2:
+            parent = dict(model.named_modules())[parts[0]]
+            setattr(parent, parts[1], new_module)
+        else:
+            setattr(model, name, new_module)
+
+    gc.collect()
+    print(f"Quantized {len(replacements)} layers to NF4")
+
+    # Move to device
+    model = model.to(device)
+
+    # Keep adapters in fp32 for stable training with gradient scaling
+    for name, param in model.named_parameters():
+        if param.requires_grad and param.dtype != torch.float32:
+            param.data = param.data.float()
+
+    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"Trainable: {trainable/1e6:.0f}M params | {mem_str()}")
+    return model
+
+
+def load_model_with_checkpoint(device, checkpoint_path):
+    """Load QLoRA model and restore adapter weights from checkpoint."""
+    model = load_model_qlora(device)
+    sd = torch.load(checkpoint_path, map_location=device, weights_only=False)
+    # Only load adapter weights
+    adapter_sd = {k: v for k, v in sd.items() if "adapter" in k}
+    model.load_state_dict(adapter_sd, strict=False)
+    print(f"Loaded {len(adapter_sd)} adapter weights from {checkpoint_path}")
+    return model
+
+
+class CommandDataset(Dataset):
+    """Load JSONL training data."""
+    def __init__(self, path, tokenizer, max_length=512):
+        self.examples = []
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+
+        with open(path) as f:
+            for line in f:
+                messages = json.loads(line)
+                # Format: system + user + assistant
+                prompt = ""
+                for msg in messages:
+                    if msg["role"] == "system":
+                        prompt += f"system\n{msg['content']}<turn_end> "
+                    elif msg["role"] == "user":
+                        prompt += f"user\n {msg['content']}<turn_end> "
+                response = ""
+                for msg in messages:
+                    if msg["role"] == "assistant":
+                        response = f"assistant\n {msg['content']}<turn_end>"
+                full_text = prompt + response
+                prompt_len = len(tokenizer.encode(prompt))
+                self.examples.append((full_text, prompt_len))
+
+    def __len__(self):
+        return len(self.examples)
+
+    def __getitem__(self, idx):
+        text, prompt_len = self.examples[idx]
+        tokens = self.tokenizer.encode(text)
+        tokens = tokens[:self.max_length]
+        prompt_len = min(prompt_len, len(tokens))
+        return torch.tensor(tokens, dtype=torch.long), prompt_len
+
+
+def collate_fn(batch):
+    """Pad sequences and create labels with masking for prompt and padding tokens."""
+    tokens_list, prompt_lens = zip(*batch)
+    max_len = max(len(x) for x in tokens_list)
+    input_ids = torch.zeros(len(tokens_list), max_len, dtype=torch.long)
+    labels = torch.full((len(tokens_list), max_len), -100, dtype=torch.long)
+    for i, (tokens, prompt_len) in enumerate(zip(tokens_list, prompt_lens)):
+        input_ids[i, :len(tokens)] = tokens
+        # Only compute loss on assistant response tokens (after prompt)
+        labels[i, prompt_len:len(tokens)] = tokens[prompt_len:]
+    return input_ids, labels
+
+
+def train_epoch(model, dataloader, optimizer, device, epoch, scaler=None):
+    model.train()
+    total_loss = 0
+    n_batches = 0
+    start = time.time()
+
+    for i, (input_ids, labels) in enumerate(dataloader):
+        input_ids = input_ids.to(device)
+        labels = labels.to(device)
+
+        # Forward — labels have -100 for prompt and padding tokens (ignored by CrossEntropyLoss)
+        if scaler:
+            with torch.amp.autocast(device_type=str(device), dtype=torch.float16):
+                output = model(input_ids)
+                logits = output.logits if hasattr(output, 'logits') else output
+                loss = nn.CrossEntropyLoss(ignore_index=-100)(
+                    logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
+                    labels[:, 1:].contiguous().view(-1)
+                )
+        else:
+            output = model(input_ids)
+            logits = output.logits if hasattr(output, 'logits') else output
+            loss = nn.CrossEntropyLoss(ignore_index=-100)(
+                logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
+                labels[:, 1:].contiguous().view(-1)
+            )
+
+        # Backward
+        optimizer.zero_grad()
+        if scaler:
+            scaler.scale(loss).backward()
+            scaler.unscale_(optimizer)
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            optimizer.step()
+
+        total_loss += loss.item()
+        n_batches += 1
+
+        log_every = 10 if len(dataloader) < 50 else 20
+        if (i + 1) % log_every == 0:
+            avg = total_loss / n_batches
+            elapsed = time.time() - start
+            it_s = (i + 1) / elapsed
+            remaining = (len(dataloader) - i - 1) / it_s / 60
+            print(f"  [{i+1}/{len(dataloader)}] loss={avg:.3f} {it_s:.2f}it/s ~{remaining:.0f}min left | {mem_str()}")
+
+    return total_loss / max(n_batches, 1)
+
+
+def evaluate(model, dataloader, device):
+    model.eval()
+    total_loss = 0
+    n_batches = 0
+
+    with torch.no_grad():
+        for input_ids, labels in dataloader:
+            input_ids = input_ids.to(device)
+            labels = labels.to(device)
+            with torch.amp.autocast(device_type=str(device), dtype=torch.float16):
+                output = model(input_ids)
+                logits = output.logits if hasattr(output, 'logits') else output
+                loss = nn.CrossEntropyLoss(ignore_index=-100)(
+                    logits[:, :-1, :].contiguous().view(-1, logits.size(-1)),
+                    labels[:, 1:].contiguous().view(-1)
+                )
+            total_loss += loss.item()
+            n_batches += 1
+
+    return total_loss / max(n_batches, 1)
+
+
+def save_adapter_checkpoint(model, path, optimizer=None, epoch=None):
+    """Save adapter weights as flat state dict (compatible with export_fmadapter)."""
+    adapter_sd = {k: v.cpu() for k, v in model.state_dict().items() if "adapter" in k}
+    torch.save(adapter_sd, path)
+    size_mb = os.path.getsize(path) / 1024**2
+    print(f"Saved checkpoint ({size_mb:.0f}MB) to {path}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="QLoRA training for hunch")
+    parser.add_argument("--epochs", type=int, default=3)
+    parser.add_argument("--batch-size", type=int, default=8)
+    parser.add_argument("--learning-rate", type=float, default=1e-4)
+    parser.add_argument("--train-data", default=str(TRAINING_DIR / "train.jsonl"))
+    parser.add_argument("--eval-data", default=str(TRAINING_DIR / "eval.jsonl"))
+    parser.add_argument("--checkpoint-dir", default=str(TRAINING_DIR / "qlora-checkpoints"))
+    parser.add_argument("--checkpoint", type=str, help="Resume from checkpoint")
+    parser.add_argument("--eval-only", action="store_true")
+    args = parser.parse_args()
+
+    device = get_device()
+    print(f"Device: {device} | {mem_str()}")
+
+    # Patch rms_norm for fp16 compatibility
+    patch_rms_norm()
+
+    # Generate training data if needed
+    if not os.path.exists(args.train_data):
+        print("Generating training data...")
+        os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py")
+
+    # Load tokenizer
+    tokenizer = AFMTokenizer(str(ASSETS / "tokenizer.model"))
+
+    # Load model
+    if args.checkpoint:
+        model = load_model_with_checkpoint(device, args.checkpoint)
+    else:
+        model = load_model_qlora(device)
+
+    if args.eval_only:
+        eval_dataset = CommandDataset(args.eval_data, tokenizer)
+        eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn)
+        eval_loss = evaluate(model, eval_loader, device)
+        print(f"Eval loss: {eval_loss:.4f}")
+        return
+
+    # Data
+    train_dataset = CommandDataset(args.train_data, tokenizer)
+    eval_dataset = CommandDataset(args.eval_data, tokenizer)
+    train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, collate_fn=collate_fn)
+    eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn)
+
+    print(f"Train: {len(train_dataset)} examples, {len(train_loader)} batches")
+    print(f"Eval: {len(eval_dataset)} examples")
+
+    # Optimizer
+    optimizer = torch.optim.AdamW(
+        [p for p in model.parameters() if p.requires_grad],
+        lr=args.learning_rate,
+        weight_decay=0.01
+    )
+
+    # Gradient scaler for mixed precision
+    scaler = torch.amp.GradScaler(device=str(device)) if (torch.cuda.is_available() or torch.backends.mps.is_available()) else None
+
+    # Checkpoint dir
+    os.makedirs(args.checkpoint_dir, exist_ok=True)
+
+    # Training loop
+    print(f"\n{'='*60}")
+    print(f"Training: {args.epochs} epochs, batch {args.batch_size}, lr {args.learning_rate}")
+    print(f"{'='*60}")
+
+    for epoch in range(args.epochs):
+        print(f"\nEpoch {epoch+1}/{args.epochs}")
+        train_loss = train_epoch(model, train_loader, optimizer, device, epoch, scaler)
+
+        # Save checkpoint before eval (in case eval crashes)
+        ckpt_path = os.path.join(args.checkpoint_dir, f"adapter-epoch{epoch+1}.pt")
+        save_adapter_checkpoint(model, ckpt_path)
+
+        eval_loss = evaluate(model, eval_loader, device)
+        print(f"  Train loss: {train_loss:.4f} | Eval loss: {eval_loss:.4f} | {mem_str()}")
+
+    # Save final
+    final_path = os.path.join(args.checkpoint_dir, "adapter-final.pt")
+    save_adapter_checkpoint(model, final_path)
+    print(f"\nDone! Export with:")
+    print(f"  python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint {final_path} --output-dir {args.checkpoint_dir}/")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/training/train_qlora_test.py b/training/train_qlora_test.py
new file mode 100644
index 0000000..7424862
--- /dev/null
+++ b/training/train_qlora_test.py
@@ -0,0 +1,216 @@
+#!/usr/bin/env python3
+"""
+True QLoRA training test: 4-bit NF4 base model + fp32 LoRA adapters.
+
+Uses bitsandbytes for NF4 quantization. Tests loading + one training step.
+
+Usage:
+  pip install bitsandbytes
+  python3 train_qlora_test.py
+
+This script:
+  1. Loads the base model
+  2. Replaces frozen Linear layers with 4-bit NF4 equivalents
+  3. Runs one training batch to verify it works
+  4. Reports memory usage at each step
+"""
+
+import sys
+import os
+import gc
+import time
+import psutil
+
+TOOLKIT_DIR = os.path.join(os.path.dirname(__file__), "adapter_training_toolkit_v26_0_0")
+sys.path.insert(0, TOOLKIT_DIR)
+
+import torch
+import tamm.utils.json
+from pathlib import Path
+
+ASSETS = Path(TOOLKIT_DIR) / "assets"
+
+
+def mem():
+    return psutil.Process().memory_info().rss / 1024**3
+
+def gpu_mem():
+    if torch.cuda.is_available():
+        return torch.cuda.memory_allocated() / 1024**3
+    elif torch.backends.mps.is_available():
+        return torch.mps.current_allocated_memory() / 1024**3
+    return 0
+
+def get_device():
+    if torch.cuda.is_available():
+        return torch.device("cuda")
+    if torch.backends.mps.is_available():
+        return torch.device("mps")
+    return torch.device("cpu")
+
+
+def quantize_linear_to_4bit(model):
+    """Replace frozen nn.Linear layers with bitsandbytes 4-bit Linear."""
+    try:
+        import bitsandbytes as bnb
+    except ImportError:
+        print("ERROR: pip install bitsandbytes")
+        sys.exit(1)
+
+    quantized = 0
+    skipped = 0
+
+    # Collect replacements (can't modify during iteration)
+    replacements = []
+    for name, module in model.named_modules():
+        if not isinstance(module, torch.nn.Linear):
+            continue
+        if "adapter" in name:
+            skipped += 1
+            continue
+        if any(p.requires_grad for p in module.parameters()):
+            skipped += 1
+            continue
+        replacements.append((name, module))
+
+    # Apply replacements
+    for name, module in replacements:
+        # Create 4-bit linear
+        new_module = bnb.nn.Linear4bit(
+            module.in_features,
+            module.out_features,
+            bias=module.bias is not None,
+            compute_dtype=torch.float16,
+            quant_type="nf4",
+        )
+
+        # Quantize weights
+        new_module.weight = bnb.nn.Params4bit(
+            module.weight.data,
+            requires_grad=False,
+            quant_type="nf4",
+            compress_statistics=True,
+        )
+        if module.bias is not None:
+            new_module.bias = module.bias
+
+        # Replace in parent module
+        parts = name.rsplit(".", 1)
+        if len(parts) == 2:
+            parent_name, child_name = parts
+            parent = dict(model.named_modules())[parent_name]
+            setattr(parent, child_name, new_module)
+        else:
+            setattr(model, name, new_module)
+
+        quantized += 1
+
+    # Free memory
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+
+    print(f"QLoRA: quantized {quantized} layers to NF4, skipped {skipped}")
+    return model
+
+
+def main():
+    device = get_device()
+    print(f"Device: {device}")
+    print(f"System RAM: {psutil.virtual_memory().total / 1024**3:.0f}GB")
+    print(f"Before: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 1: Load model config
+    with open(ASSETS / "base-model-config.json") as f:
+        config = tamm.utils.json.load(f)
+
+    # Step 2: Create model on CPU
+    print("\n--- Creating model ---")
+    model = config.create_model()
+    print(f"After create_model: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 3: Load weights via mmap
+    print("\n--- Loading weights (mmap) ---")
+    sd = torch.load(str(ASSETS / "base-model.pt"), map_location="cpu", mmap=True, weights_only=False)
+    model.load_state_dict(sd, strict=True)
+    del sd; gc.collect()
+    print(f"After load+del: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 4: Freeze non-adapter params
+    for name, param in model.named_parameters():
+        param.requires_grad = "adapter" in name
+
+    trainable_before = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    frozen_before = sum(p.numel() for p in model.parameters() if not p.requires_grad)
+    print(f"Trainable: {trainable_before/1e6:.0f}M, Frozen: {frozen_before/1e6:.0f}M")
+
+    # Step 5: Quantize frozen layers to 4-bit NF4
+    print("\n--- Quantizing to NF4 ---")
+    model = quantize_linear_to_4bit(model)
+    gc.collect()
+    print(f"After quantize: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 6: Move to device
+    print(f"\n--- Moving to {device} ---")
+    model = model.to(device)
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    print(f"After to({device}): RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+    # Step 7: Verify trainable params are fp32
+    for name, param in model.named_parameters():
+        if param.requires_grad and "adapter" in name:
+            if param.dtype != torch.float32:
+                param.data = param.data.float()
+
+    trainable_after = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print(f"Trainable params: {trainable_after/1e6:.0f}M")
+
+    # Step 8: Test one forward + backward pass
+    print("\n--- Test forward/backward ---")
+    try:
+        tokenizer_path = ASSETS / "tokenizer.model"
+        from tamm.tokenizers.afm import AFMTokenizer
+        tokenizer = AFMTokenizer(str(tokenizer_path))
+
+        # Create a simple input
+        text = "Output a single shell command for zsh on macOS.\nfind files changed in the last hour"
+        tokens = tokenizer.encode(text)
+        input_ids = torch.tensor([tokens[:50]], device=device)
+        labels = input_ids.clone()
+
+        # Forward pass
+        output = model(input_ids)
+        if hasattr(output, 'logits'):
+            logits = output.logits
+        else:
+            logits = output
+
+        # Compute loss
+        loss_fn = torch.nn.CrossEntropyLoss()
+        shift_logits = logits[:, :-1, :].contiguous()
+        shift_labels = labels[:, 1:].contiguous()
+        loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+        print(f"Loss: {loss.item():.4f}")
+
+        # Backward pass
+        loss.backward()
+        print(f"After backward: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+        # Check gradients exist on adapter params
+        grad_count = sum(1 for p in model.parameters() if p.grad is not None)
+        print(f"Params with gradients: {grad_count}")
+
+        print("\nSUCCESS: QLoRA forward + backward works!")
+
+    except Exception as e:
+        print(f"\nFailed at forward/backward: {e}")
+        import traceback
+        traceback.print_exc()
+
+    print(f"\nFinal: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB")
+
+
+if __name__ == "__main__":
+    main()