diff --git a/.gitignore b/.gitignore index 4a7805a..d8ded18 100644 --- a/.gitignore +++ b/.gitignore @@ -16,3 +16,10 @@ __pycache__/ *.tar.gz CLAUDE.md benchmark/GUIDED_GENERATION.md +training/train.jsonl +training/eval.jsonl +training/adapter_training_toolkit* +training/README.md +training/exports/ +training/qlora-checkpoints/ +training/bench_mps_results.jsonl diff --git a/README.md b/README.md index cbd9bed..480f58f 100644 --- a/README.md +++ b/README.md @@ -171,6 +171,18 @@ make install This clones [tldr-pages](https://github.com/tldr-pages/tldr), parses all entries into Q/A pairs, adds macOS-specific overrides, and rebuilds the FTS5 index. +## LoRA Adapter Training (experimental) + +The `training/` directory contains infrastructure for fine-tuning Apple's on-device 3B model using LoRA adapters. QLoRA training works on a free Colab T4 or locally on a 24GB Mac. See `training/TRAINING.md` for full details, results, and notebooks. + +```bash +hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour" +``` + +Current finding: adapter + retrieval reaches ~86% accuracy (vs ~79% retrieval alone). QLoRA matches full LoRA quality, and Mac-trained adapters match T4-trained. + +> **Known bug (as of April 2026):** Apple's `TGOnDeviceInferenceProviderService` caches a full copy of the adapter (~160MB) on every CLI invocation and never cleans up. Repeated adapter calls from CLI tools can consume significant disk space. Apple has confirmed this as a known bug specific to CLI tools. See `training/adapter-disk-leak-findings.md` for details and workaround. + ## Known limitations - **4K token context window** — the system prompt + 8 examples + query + output must fit. Current prompts use ~200-400 tokens, well within budget. diff --git a/benchmark/REVIEW_CRITERIA.md b/benchmark/REVIEW_CRITERIA.md new file mode 100644 index 0000000..8e9ac7e --- /dev/null +++ b/benchmark/REVIEW_CRITERIA.md @@ -0,0 +1,95 @@ +# Benchmark Review Criteria + +Rules for deciding whether a non-exact result is "functionally correct" and should be added to alternates.json. Apply these consistently across all reviews. + +## ACCEPT — add to alternates.json + +### Placeholder variations +Different placeholder names for the same command structure: +- `file` vs `filename` vs `file.txt` — accept +- `src dst` vs `source destination` vs `source_directory destination_directory` — accept +- `user@host` vs `user@server` vs `username@remote_host` — accept +- `example.com` vs `api.example.com` vs `localhost:8000` — accept + +### Quote style +- Single vs double quotes: `'*.png'` vs `"*.png"` — accept +- With or without quotes when not ambiguous: `-name .DS_Store` vs `-name '.DS_Store'` — accept + +### Flag reordering +Same flags in different order: +- `tar -xvzf` vs `tar -zxvf` — accept +- `rsync -avz` vs `rsync -avzh` (extra harmless flag) — accept cautiously + +### Harmless extra flags +Flags that don't change the core behavior: +- `tar -czvf` (verbose) vs `tar -czf` — accept +- `cp -R` vs `cp -r` (same on macOS) — accept +- Adding `--progress` to rsync — accept + +### Format variations +Same result, slightly different format: +- `git log --oneline` vs `git log --pretty=oneline` — accept +- `echo $SHELL` vs `echo $0` — accept (both show shell) + +## REJECT — do not add to alternates.json + +### Wrong command entirely +- `system_profiler` for "monitor cpu usage" (should be `top`) — reject +- `pbcopy` for "paste from clipboard" (that's copy, not paste) — reject +- `cls` for "clear terminal" (Windows command) — reject + +### Wrong flags that change meaning +- `find . -mtime -60` for "files changed in last hour" (`-mtime` is days, not minutes) — reject +- `find . -mtime +1` for "files modified today" (opposite: MORE than 1 day ago) — reject +- `head -50` for "last 50 lines" (head shows FIRST, not last) — reject +- `tail -n 20` for "first 20 lines" (tail shows LAST, not first) — reject + +### Missing critical parts +- `cp -r directory` (missing destination) — reject +- `find .DS_Store -delete` (missing `.` path, only current dir entry) — reject +- `zip -r .` (missing output filename) — reject +- `ssh user@server` (missing `-i key` when prompt asks for specific key) — reject + +### Hallucinated commands/flags +- `git log --no-pushed` (not a real flag) — reject +- `git rename-branch` (not a real command) — reject +- `find . -type symlink` (invalid type, should be `l`) — reject +- `link -s` (not the same as `ln -s`) — reject +- `zipdir`, `pylist`, `mcal` — reject + +### Broadened scope +- `find . -empty` for "find empty directories" (also finds empty files) — reject +- `find . -name node_modules` for "find directories named node_modules" (also finds files) — accept only with `-type d` +- `git branch --merged | xargs git branch -d` without `grep -v main` (would delete main) — reject + +### Functionally different approach +- `comm -12 <(sort file1) <(sort file2)` for "compare two files" (shows common lines, not differences) — reject +- `du -sh /` for "show disk usage" (directory usage, not filesystem usage like `df`) — reject +- `find . -name '*.py' | wc -l` for "count lines in python files" (counts FILES, not lines IN them) — reject + +### Piped through unnecessary commands +- `cat file | head -20` for "first 20 lines" — accept (useless cat but correct) +- `find ... | wc -l` when it should be `find ... -exec wc -l` — reject (counts files not lines) + +## EDGE CASES + +### `find . -empty` for "find empty directories" +REJECT. `-empty` matches both empty files and directories. The prompt specifically asks for directories. Need `-type d -empty`. + +### `sips -s format jpeg input.jpg --out output.jpg` (same format in and out) +REJECT. The prompt says "convert to different format." While the command structure is correct, the example converts jpg→jpg. Accept only if input and output formats differ. + +### `sips -s format jpg` (without `--out`) +REJECT. `jpg` is not a valid sips format name (should be `jpeg`). + +### curl POST with different URLs/bodies +ACCEPT if structure is correct: has `-X POST`, has `-H "Content-Type: application/json"`, has `-d`. Different URLs and body content are just placeholder variations. + +### rsync with `--delete` +REJECT. Adding `--delete` removes files at destination that don't exist at source. That's a meaningfully different and potentially destructive operation. + +### `caffeinate -t 3600` for "prevent mac from sleeping" +ACCEPT. Keeps awake for 1 hour — reasonable interpretation. + +### `env | grep PATH` vs `export PATH` +ACCEPT both. Different mechanisms but both show PATH. diff --git a/benchmark/alternates.json b/benchmark/alternates.json index 239acf9..5079c1d 100644 --- a/benchmark/alternates.json +++ b/benchmark/alternates.json @@ -10,7 +10,9 @@ "ls", "ls -la", "ls -a", - "ls -l" + "ls -l", + "ls -1", + "ls ." ], "3": [ "df -h", @@ -62,7 +64,8 @@ "find . -name '*.png'", "find . -name \"*.png\"", "find . -iname '*.png'", - "find . -type f -name '*.png'" + "find . -type f -name '*.png'", + "find . -type f -name \"*.png\"" ], "14": [ "find . -type d -empty", @@ -83,7 +86,9 @@ "find . -name '.DS_Store' -delete", "find . -name .DS_Store -delete", "find . -name '.DS_Store' -exec rm {} +", - "find . -name '.DS_Store' -exec rm {} \\;" + "find . -name '.DS_Store' -exec rm {} \\;", + "find . -name \".DS_Store\" -delete", + "find . -name \".DS_Store\" -exec rm {} \\;" ], "18": [ "find . -type l" @@ -97,13 +102,17 @@ "find . -type d -name 'node_modules'", "find -name 'node_modules'", "find . -name 'node_modules'", - "find . -name \"node_modules\"" + "find . -name \"node_modules\"", + "find . -name node_modules" ], "21": [ "find . -name '*.py' -exec wc -l {} +", "find . -name '*.py' | xargs wc -l", "wc -l **/*.py", - "find . -name '*.py' -exec wc -l {} \\;" + "find . -name '*.py' -exec wc -l {} \\;", + "find . -name \"*.py\" -exec wc -l {} +", + "find . -name \"*.py\" | xargs wc -l", + "wc -l *.py" ], "22": [ "du -sh * | sort -hr", @@ -113,7 +122,8 @@ "23": [ "kill $(lsof -t -i :3000)", "lsof -t -i :3000 | xargs kill", - "fuser -k 3000/tcp" + "fuser -k 3000/tcp", + "kill $(lsof -t -i :3000 )" ], "24": [ "find . -size +1G", @@ -139,7 +149,15 @@ "tar -czf compressed_folder.tar.gz ./", "tar -czf folder.tar.gz /path/to/folder", "tar czf archive.tar.gz /path/to/folder", - "tar czf folder.tar.gz folder" + "tar czf folder.tar.gz folder", + "tar -czf file.tar.gz folder", + "tar -czf folder.tar.gz .", + "tar -czf folder.tar.gz folder", + "tar -czvf folder.tar.gz folder", + "tar -czf path/to/compressed.tar.gz path/to/folder", + "tar -czvf /path/to/output.tar.gz /path/to/folder", + "tar -czf file.tar.gz .", + "tar -czf archive.tar.gz folder" ], "28": [ "tar xzf file.tar.gz", @@ -148,14 +166,22 @@ "tar -xf archive.tar.gz", "tar xzvf archive.tar.gz", "tar -xvzf file.tar.gz", - "tar xvf file.tar.gz" + "tar xvf file.tar.gz", + "tar -xvf archive.tar.gz", + "tar -xvzf archive.tar.gz", + "tar -xvzf filename.tar.gz", + "tar -xzvf file.tar.gz", + "tar -zxvf archive.tar.gz", + "tar -zxvf file.tar.gz", + "tar -xvf file.tar.gz" ], "29": [ "git branch --sort=-committerdate", "git branch -a --sort=-committerdate" ], "30": [ - "git log --oneline" + "git log --oneline", + "git log --pretty=oneline" ], "31": [ "git diff --staged", @@ -169,14 +195,16 @@ "33": [ "git log origin/main..HEAD", "git log origin/master..HEAD", - "git log --oneline origin/main..HEAD" + "git log --oneline origin/main..HEAD", + "git log origin/main..HEAD --oneline" ], "34": [ "git branch -m old new", "git branch -m oldname newname", "git branch -m ", "git branch -m old_branch_name new_branch_name", - "git branch -m new_branch_name" + "git branch -m new_branch_name", + "git branch -m branch_name1 branch_name2" ], "35": [ "git branch --merged | grep -v main | xargs git branch -d", @@ -185,7 +213,8 @@ "36": [ "netstat -an", "netstat", - "lsof -i" + "lsof -i", + "netstat -ln" ], "37": [ "lsof -i :8080", @@ -197,7 +226,12 @@ "curl -o file https://example.com/file", "wget https://example.com/file", "curl -O url", - "curl -o filename url" + "curl -o filename url", + "curl -O https://example.com/file.txt", + "curl -O https://example.com/file.zip", + "curl -o file.zip https://example.com/file.zip", + "wget https://example.com/file.pdf", + "wget https://example.com/file.zip -O file.zip" ], "39": [ "curl -I https://example.com", @@ -210,7 +244,11 @@ "curl -X POST https://example.com/api/endpoint -H 'Content-Type: application/json' -d '{\"key\": \"value\"}'", "curl -X POST -H 'Content-Type: application/json' -d '{\"key\": \"value\"}' https://example.com", "curl -X POST https://example.com -H 'Content-Type: application/json' -d '{\"key1\": \"value1\", \"key2\": \"value2\"}'", - "curl -X POST -H 'Content-Type: application/json' -d '{\"name\": \"john\", \"age\": 25}' https://example.com" + "curl -X POST -H 'Content-Type: application/json' -d '{\"name\": \"john\", \"age\": 25}' https://example.com", + "curl -X POST -H \"Content-Type: application/json\" -d '{\"key\": \"value\"}' https://example.com", + "curl -X POST http://localhost:8000/api/endpoint -H \"Content-Type: application/json\" -d '{\"key\": \"value\"}'", + "curl -X POST http://localhost:8000/api/post -H \"Content-Type: application/json\" -d '{\"key\":\"value\"}'", + "curl -X POST https://api.example.com/endpoint -H \"Content-Type: application/json\" -d '{\"key1\": \"value1\", \"key2\": \"value2\"}'" ], "41": [ "tail -f logfile", @@ -224,7 +262,9 @@ ], "42": [ "tail -n 50 file", - "tail -50 file" + "tail -50 file", + "tail -50 filename", + "tail -n 50 filename" ], "43": [ "ls | wc -l", @@ -249,25 +289,34 @@ "cp -R src dst", "cp -a src dst", "cp -r source_directory destination_directory", - "cp -r path/to/source_directory path/to/target_directory" + "cp -r path/to/source_directory path/to/target_directory", + "cp -R source_directory destination_directory", + "cp -r src/ dst/", + "cp -r src/ dest/" ], "48": [ "mkdir -p path/to/dir", "mkdir -p /path/to/create/directory", "mkdir -p \"path/to/directory\"", - "mkdir -p parent_directory_path" + "mkdir -p parent_directory_path", + "mkdir -p /path/to/directory", + "mkdir -p path/to/directory", + "mkdir -p directory", + "mkdir -p directory_name" ], "49": [ "chmod +x file", "chmod 755 file", - "chmod +x executable" + "chmod +x executable", + "chmod +x filename" ], "50": [ "stat -f '%A' file", "stat -f '%Lp' file" ], "51": [ - "find . -type f -exec md5 {} + | sort | uniq -d" + "find . -type f -exec md5 {} + | sort | uniq -d", + "find . -type f -exec md5 {} + | sort | uniq -d | sort" ], "52": [ "top", @@ -287,18 +336,23 @@ "55": [ "pkill processname", "killall processname", - "pkill process-name" + "pkill process-name", + "pkill myprocess", + "pkill bash", + "pkill shell_name" ], "56": [ "dig example.com", "nslookup example.com", - "host example.com" + "host example.com", + "dig domain.com" ], "57": [ "nc -zv host 80", "nc -z host 80", "nmap -p 80 host", - "nc -zv hostname port" + "nc -zv hostname port", + "nc -zv hostname 80" ], "58": [ "openssl rand -base64 32", @@ -307,7 +361,8 @@ ], "59": [ "md5 file", - "md5sum file" + "md5sum file", + "md5 file.txt" ], "60": [ "shasum -a 256 file", @@ -336,17 +391,22 @@ ], "65": [ "caffeinate", - "caffeinate -d" + "caffeinate -d", + "caffeinate -t 3600", + "caffeinate -t 86400" ], "66": [ "say hello", "say 'hello'", "say \"hello\"", "say 'Hello, world!'", - "say 'hello world'" + "say 'hello world'", + "say \"Hello, world!\"", + "say \"hello world\"" ], "67": [ - "pmset -g batt" + "pmset -g batt", + "pmset -g" ], "68": [ "sudo dscacheutil -flushcache", @@ -380,7 +440,10 @@ "75": [ "sips -s format png input.jpg --out output.png", "convert input.jpg output.png", - "convert path/to/input_image.jpg path/to/output_image.png" + "convert path/to/input_image.jpg path/to/output_image.png", + "sips -s format jpeg input.png --out output.jpeg", + "sips -s format png input.jpg", + "sips -s format webp input.jpg --out output.webp" ], "76": [ "sips --resampleWidth 800 image.jpg", @@ -405,7 +468,10 @@ "ln -s source_path target_path", "ln -s source destination", "ln -s /path/to/file /path/to/symlink", - "ln -s path/to/file_or_directory path/to/symlink" + "ln -s path/to/file_or_directory path/to/symlink", + "ln -s source_path destination_path", + "ln -s src dest", + "ln -s src dst" ], "81": [ "lsof -i -P -n | grep LISTEN", @@ -425,12 +491,16 @@ "git cherry-pick ", "git cherry-pick commit", "git cherry-pick ", - "git cherry-pick HEAD~1" + "git cherry-pick HEAD~1", + "git cherry-pick HEAD^", + "git cherry-pick commit-hash" ], "85": [ "ls -lh file", "ls -lh", - "du -sh file" + "du -sh file", + "du -h file", + "du -hs filename" ], "86": [ "find . -perm 777", @@ -441,20 +511,34 @@ "87": [ "head -n 20 file", "head -20 file", - "head -n 20 < filename" + "head -n 20 < filename", + "head -20 filename", + "cat file | head -20" ], "88": [ "ssh -i key.pem user@host", "ssh -i path/to/key_file username@remote_host", "ssh -i /path/to/key username@host", "ssh username@host -i path/to/key", - "ssh username@remote_host -i path/to/key_file" + "ssh username@remote_host -i path/to/key_file", + "ssh user@hostname -i path/to/key_file.pem", + "ssh user@server -i path/to/key.pem", + "ssh user@server -i path/to/key_file.pem", + "ssh user@server.example.com -i ~/.ssh/id_rsa", + "ssh user@host -i path/to/key", + "ssh user@host -i path/to/key_file.pem" ], "89": [ "rsync -avz src/ user@host:dst/", "rsync -avz src/ user@host:dst", "rsync -avz /path/to/source /path/to/destination", - "rsync -avz source_path destination_path" + "rsync -avz source_path destination_path", + "rsync -avz source_directory remote_server", + "rsync -avz . remote_server:destination_directory", + "rsync -avz --progress source_directory remote_server", + "rsync -avz --progress src_dir remote_server", + "rsync -avz /path/to/local/directory remote_server:destination_directory", + "rsync -avz /path/to/local/directory remote_server:path/to/remote/directory" ], "90": [ "crontab -l" @@ -462,7 +546,9 @@ "91": [ "grep -ri pattern .", "grep -ri 'pattern' .", - "grep -rni pattern ." + "grep -rni pattern .", + "find . -type f -exec grep -ri 'pattern' {} +", + "find . -type f -exec grep -ri 'pattern' +" ], "92": [ "wc file", @@ -473,7 +559,16 @@ "zip -r archive.zip directory", "zip -r archive.zip dir", "zip -r archive.zip directory/", - "zip -r /path/to/directory.zip /path/to/directory" + "zip -r /path/to/directory.zip /path/to/directory", + "zip -r archive.zip .", + "zip -r archive.zip ./", + "zip -r directory_name.zip directory", + "zip -r file.zip .", + "zip -r file.zip ./", + "zip -r mydir.zip ./mydir", + "zip -r myfile.zip mydir", + "zip -r archive.zip directory_to_zip", + "zip -r file.zip directory" ], "94": [ "unzip file.zip -d directory", @@ -483,7 +578,14 @@ "unzip -d /path/to/output file.zip", "unzip -d destination file.zip", "unzip filename -d destination_directory", - "unzip filename -d destination" + "unzip filename -d destination", + "unzip -d target_dir filename", + "unzip file.zip -d /path/to/directory", + "unzip file.zip -d /path/to/unzip", + "unzip file.zip -d destination", + "unzip file.zip -d destination_directory", + "unzip file.zip -d /path/to/unzipped", + "unzip file.zip -d destination/" ], "95": [ "system_profiler SPHardwareDataType" @@ -492,7 +594,9 @@ "system_profiler SPUSBDataType" ], "97": [ - "date -r 1700000000" + "date -r 1700000000", + "date -r $TIMESTAMP", + "date -r $UNIX_TIMESTAMP" ], "98": [ "stat -f '%B' file | xargs date -r", @@ -503,11 +607,15 @@ "echo -n 'string' | base64", "printf 'string' | base64", "echo 'string' | base64", - "echo -n 'text' | base64" + "echo -n 'text' | base64", + "echo 'input string' | base64", + "echo 'your string' | base64", + "echo 'your_string' | base64" ], "100": [ "env | grep PATH", "printenv | grep PATH", - "echo $PATH" + "echo $PATH", + "export PATH" ] } \ No newline at end of file diff --git a/benchmark/run.py b/benchmark/run.py index b7365ed..6a1da41 100755 --- a/benchmark/run.py +++ b/benchmark/run.py @@ -504,6 +504,135 @@ def approach_hunch_multi_warm(prompt): return _run_hunch(prompt, ["--guided", "multi", "--temperature", "0.3"]) +ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "exports" / "hunch.fmadapter") +QLORA_FP16_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_fp16.fmadapter") +QLORA_NF4_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora.fmadapter") +QLORA_OVERRIDE_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_overrides.fmadapter") +LORA_OVERRIDE_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "exports" / "hunch_overrides.fmadapter") + + +def _run_hunch_batch(prompts, extra_args=None, runs=1): + """Run all prompts in a single hunch process using --batch mode. + + This avoids the TGOnDeviceInferenceProviderService disk leak where each + process invocation caches a ~160MB copy of the adapter. + + Returns: dict keyed by (run, id) if runs > 1, or by id if runs == 1. + """ + # Write prompts to a temp JSONL file + import tempfile + with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as f: + for p in prompts: + f.write(json.dumps({"id": p["id"], "prompt": p["prompt"]}) + "\n") + batch_path = f.name + + cmd = ["hunch"] + if extra_args: + cmd.extend(extra_args) + cmd.extend(["--batch", batch_path]) + if runs > 1: + cmd.extend(["--runs", str(runs)]) + + try: + proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True) + results = {} + count = 0 + total = len(prompts) * runs + for line in proc.stdout: + line = line.strip() + if not line: + continue + try: + r = json.loads(line) + count += 1 + status = r.get("result", "")[:40] + print(f" [{count}/{total}] #{r.get('id', '?'):3d}: {r.get('prompt', '')[:50]:50s} → {status} ({r.get('total_time', 0)}s)") + if runs > 1: + results[(r["run"], r["id"])] = r + else: + results[r["id"]] = r + except (json.JSONDecodeError, KeyError): + continue + proc.wait() + return results + except Exception: + return {} + finally: + os.unlink(batch_path) + + +def _make_batch_approach(extra_args): + """Create a batch-aware approach function for adapter benchmarks.""" + def approach(prompt): + # Fallback for single-prompt calls (e.g. --ids) + return _run_hunch(prompt, extra_args) + approach._batch_args = extra_args + return approach + + +def approach_adapter_only(prompt): + """LoRA adapter only, no retrieval.""" + return _run_hunch(prompt, ["--adapter", ADAPTER_PATH, "--limit", "0"]) +approach_adapter_only._batch_args = ["--adapter", ADAPTER_PATH, "--limit", "0"] + + +def approach_adapter_retrieval(prompt): + """LoRA adapter + retrieval.""" + return _run_hunch(prompt, ["--adapter", ADAPTER_PATH]) +approach_adapter_retrieval._batch_args = ["--adapter", ADAPTER_PATH] + + +def approach_fp16lora_only(prompt): + """fp16 LoRA adapter only, no retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH, "--limit", "0"]) +approach_fp16lora_only._batch_args = ["--adapter", QLORA_FP16_ADAPTER_PATH, "--limit", "0"] + + +def approach_fp16lora_retrieval(prompt): + """fp16 LoRA adapter + retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_FP16_ADAPTER_PATH]) +approach_fp16lora_retrieval._batch_args = ["--adapter", QLORA_FP16_ADAPTER_PATH] + + +def approach_qlora_only(prompt): + """True QLoRA (NF4) adapter only, no retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH, "--limit", "0"]) +approach_qlora_only._batch_args = ["--adapter", QLORA_NF4_ADAPTER_PATH, "--limit", "0"] + + +def approach_qlora_retrieval(prompt): + """True QLoRA (NF4) adapter + retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_NF4_ADAPTER_PATH]) +approach_qlora_retrieval._batch_args = ["--adapter", QLORA_NF4_ADAPTER_PATH] + + +def approach_qlora_override_only(prompt): + """QLoRA trained on overrides only, no retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]) +approach_qlora_override_only._batch_args = ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"] + + +def approach_qlora_override_retrieval(prompt): + """QLoRA trained on overrides only + retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH]) +approach_qlora_override_retrieval._batch_args = ["--adapter", QLORA_OVERRIDE_ADAPTER_PATH] + + +QLORA_MPS_ADAPTER_PATH = str(Path(__file__).parent.parent / "training" / "qlora-checkpoints" / "hunch_qlora_mps.fmadapter") + + +def approach_qlora_mps_only(prompt): + """QLoRA trained on Mac (MPS), no retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_MPS_ADAPTER_PATH, "--limit", "0"]) +approach_qlora_mps_only._batch_args = ["--adapter", QLORA_MPS_ADAPTER_PATH, "--limit", "0"] + + +def approach_qlora_mps_retrieval(prompt): + """QLoRA trained on Mac (MPS) + retrieval.""" + return _run_hunch(prompt, ["--adapter", QLORA_MPS_ADAPTER_PATH]) +approach_qlora_mps_retrieval._batch_args = ["--adapter", QLORA_MPS_ADAPTER_PATH] + + def approach_dynshot_tldr(prompt): """Dynamic few-shot using tldr+overrides FTS5 index (21k entries).""" import sqlite3 @@ -576,6 +705,18 @@ def approach_dynshot_holdout(prompt): "hunch-multi": approach_hunch_multi, "hunch-cotmulti": approach_hunch_cotmulti, "hunch-multi-warm": approach_hunch_multi_warm, + "adapter-only": approach_adapter_only, + "adapter-retrieval": approach_adapter_retrieval, + "fp16lora-only": approach_fp16lora_only, + "fp16lora-retrieval": approach_fp16lora_retrieval, + "qlora-only": approach_qlora_only, + "qlora-retrieval": approach_qlora_retrieval, + "qlora-override-only": approach_qlora_override_only, + "qlora-override-retrieval": approach_qlora_override_retrieval, + "qlora-mps-only": approach_qlora_mps_only, + "qlora-mps-retrieval": approach_qlora_mps_retrieval, + "lora-override-only": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH, "--limit", "0"]), + "lora-override-retrieval": _make_batch_approach(["--adapter", LORA_OVERRIDE_ADAPTER_PATH]), "hunch-sc": approach_hunch_sc, "sc-dynshot": approach_selfconsist_dynshot, "sc-warm": approach_selfconsist_warm, @@ -595,14 +736,51 @@ def load_prompts(ids=None, category=None): return prompts -def run_benchmark(approach_name, prompts): +def run_benchmark(approach_name, prompts, suffix="", runs=1): func = APPROACHES[approach_name] - outfile = RESULTS_DIR / f"{approach_name}.jsonl" print(f"\n{'=' * 60}") - print(f" APPROACH: {approach_name} ({len(prompts)} prompts)") + print(f" APPROACH: {approach_name} ({len(prompts)} prompts{f', {runs} runs' if runs > 1 else ''})") print(f"{'=' * 60}") + # Use batch mode for adapter approaches (avoids disk leak) + batch_args = getattr(func, '_batch_args', None) + if batch_args and len(prompts) > 1: + print(f" Using --batch mode (single process, avoids adapter disk leak)") + batch_results = _run_hunch_batch(prompts, batch_args, runs=runs) + + all_results = [] + for run_num in range(1, runs + 1): + run_suffix = f"-run{run_num}" if runs > 1 else "" + outfile = RESULTS_DIR / f"{approach_name}{suffix}{run_suffix}.jsonl" + + results = [] + with open(outfile, "w") as f: + for p in prompts: + if runs > 1: + br = batch_results.get((run_num, p["id"]), {}) + else: + br = batch_results.get(p["id"], {}) + r = { + "result": br.get("result", "[BATCH_ERROR]"), + "total_time": br.get("total_time", 0), + } + r["id"] = p["id"] + r["approach"] = approach_name + r["prompt"] = p["prompt"] + r["expected"] = p["expected"] + r["category"] = p["category"] + + f.write(json.dumps(r) + "\n") + f.flush() + results.append(r) + + print(f" Saved to {outfile}") + all_results.extend(results) + + return all_results + + outfile = RESULTS_DIR / f"{approach_name}{suffix}.jsonl" results = [] with open(outfile, "w") as f: for i, p in enumerate(prompts): @@ -637,6 +815,7 @@ def main(): parser.add_argument("approach", nargs="?", default="all", help="Approach name or 'all'") parser.add_argument("--ids", help="Comma-separated prompt IDs") parser.add_argument("--category", help="Filter by category: simple, flags, composed") + parser.add_argument("--runs", type=int, default=1, help="Number of runs (output files suffixed -run1, -run2, ...)") args = parser.parse_args() ids = [int(x) for x in args.ids.split(",")] if args.ids else None @@ -655,7 +834,22 @@ def main(): if a not in APPROACHES: print(f"Unknown approach: {a}. Available: {', '.join(APPROACHES.keys())}") sys.exit(1) - run_benchmark(a, prompts) + + for a in approaches: + func = APPROACHES[a] + batch_args = getattr(func, '_batch_args', None) + if batch_args and args.runs > 1: + # Adapter approaches: all runs in one process + run_benchmark(a, prompts, runs=args.runs) + elif args.runs > 1: + # Non-adapter approaches: loop externally + for run_num in range(1, args.runs + 1): + print(f"\n{'#' * 60}") + print(f" RUN {run_num}/{args.runs}") + print(f"{'#' * 60}") + run_benchmark(a, prompts, suffix=f"-run{run_num}") + else: + run_benchmark(a, prompts) print(f"\nDone. Run: python3 score.py") diff --git a/cli/Sources/Hunch/main.swift b/cli/Sources/Hunch/main.swift index 18c9c65..583d0b2 100644 --- a/cli/Sources/Hunch/main.swift +++ b/cli/Sources/Hunch/main.swift @@ -158,6 +158,9 @@ struct Hunch { let samples = parseFlag(&args, flag: "--samples").flatMap(Int.init) ?? 1 let limit = parseFlag(&args, flag: "--limit").flatMap(Int.init) ?? 8 let guided = parseFlag(&args, flag: "--guided") + let adapterPath = parseFlag(&args, flag: "--adapter") + let batchFile = parseFlag(&args, flag: "--batch") + let batchRuns = parseFlag(&args, flag: "--runs").flatMap(Int.init) ?? 1 // Parse mode var mode: Mode = .suggest @@ -169,6 +172,20 @@ struct Hunch { args.removeFirst() } + // Batch mode: read prompts from JSONL, run all in one process + if let batchFile { + do { + try await runBatch( + file: batchFile, adapterPath: adapterPath, temperature: temperature, + limit: limit, guided: guided, runs: batchRuns + ) + } catch { + fputs("error: \(error.localizedDescription)\n", stderr) + exit(1) + } + return + } + guard !args.isEmpty else { printUsage() return @@ -238,9 +255,19 @@ struct Hunch { let systemPrompt = buildSystemPrompt(mode: mode, examples: examples) do { - let model = SystemLanguageModel( - guardrails: .permissiveContentTransformations - ) + let model: SystemLanguageModel + if let adapterPath { + let adapterURL = URL(fileURLWithPath: adapterPath) + let adapter = try SystemLanguageModel.Adapter(fileURL: adapterURL) + model = SystemLanguageModel( + adapter: adapter, + guardrails: .permissiveContentTransformations + ) + } else { + model = SystemLanguageModel( + guardrails: .permissiveContentTransformations + ) + } // Build generation options only when temperature is set let genOptions: GenerationOptions? = temperature.map { @@ -380,6 +407,127 @@ struct Hunch { } } + static func runBatch( + file: String, adapterPath: String?, temperature: Double?, + limit: Int, guided: String?, runs: Int = 1 + ) async throws { + // Read JSONL file + let contents = try String(contentsOfFile: file, encoding: .utf8) + let lines = contents.components(separatedBy: .newlines).filter { !$0.isEmpty } + + // Load model once + let model: SystemLanguageModel + if let adapterPath { + let adapterURL = URL(fileURLWithPath: adapterPath) + let adapter = try SystemLanguageModel.Adapter(fileURL: adapterURL) + model = SystemLanguageModel( + adapter: adapter, + guardrails: .permissiveContentTransformations + ) + } else { + model = SystemLanguageModel( + guardrails: .permissiveContentTransformations + ) + } + + let genOptions: GenerationOptions? = temperature.map { + var opts = GenerationOptions() + opts.temperature = $0 + return opts + } + + let dbPath = findDatabase() + + for run in 1...runs { + for line in lines { + guard let data = line.data(using: .utf8), + let entry = try? JSONSerialization.jsonObject(with: data) as? [String: Any], + let idValue = entry["id"], let id = idValue as? Int ?? (idValue as? NSNumber)?.intValue, + let prompt = entry["prompt"] as? String else { + continue + } + + let start = CFAbsoluteTimeGetCurrent() + var result: String + + do { + let examples = dbPath != nil + ? searchBank(dbPath: dbPath!, query: prompt, limit: limit) + : [] + let systemPrompt = buildSystemPrompt(mode: .suggest, examples: examples) + + let session: LanguageModelSession + if !systemPrompt.isEmpty { + let segment = Transcript.TextSegment(content: systemPrompt) + let instructions = Transcript.Instructions( + segments: [.text(segment)], + toolDefinitions: [] + ) + session = LanguageModelSession( + model: model, + transcript: Transcript(entries: [.instructions(instructions)]) + ) + } else { + session = LanguageModelSession(model: model) + } + + if guided == "plain" { + let response: LanguageModelSession.Response + if let opts = genOptions { + response = try await session.respond(to: prompt, generating: ShellCommand.self, options: opts) + } else { + response = try await session.respond(to: prompt, generating: ShellCommand.self) + } + result = response.content.command + } else if guided == "cot" { + let response: LanguageModelSession.Response + if let opts = genOptions { + response = try await session.respond(to: prompt, generating: ShellCommandCoT.self, options: opts) + } else { + response = try await session.respond(to: prompt, generating: ShellCommandCoT.self) + } + result = response.content.command + } else if guided == "multi" { + let response: LanguageModelSession.Response + if let opts = genOptions { + response = try await session.respond(to: prompt, generating: ShellCommandMulti.self, options: opts) + } else { + response = try await session.respond(to: prompt, generating: ShellCommandMulti.self) + } + result = majorityVote([response.content.first, response.content.second, response.content.third]) + } else { + // Default: plain string + let response: LanguageModelSession.Response + if let opts = genOptions { + response = try await session.respond(to: prompt, options: opts) + } else { + response = try await session.respond(to: prompt) + } + result = stripMarkdown(response.content) + } + } catch { + result = "[ERROR] \(error.localizedDescription)" + } + + let elapsed = round((CFAbsoluteTimeGetCurrent() - start) * 100) / 100 + var output: [String: Any] = [ + "id": id, + "prompt": prompt, + "result": result, + "total_time": elapsed + ] + if runs > 1 { + output["run"] = run + } + if let jsonData = try? JSONSerialization.data(withJSONObject: output), + let jsonString = String(data: jsonData, encoding: .utf8) { + print(jsonString) + fflush(stdout) + } + } + } + } + static func printUsage() { let dbStatus = findDatabase() != nil ? "found" : "not found" let envTemp = ProcessInfo.processInfo.environment["HUNCH_TEMPERATURE"] ?? "not set" diff --git a/training/TRAINING.md b/training/TRAINING.md new file mode 100644 index 0000000..b3c0b88 --- /dev/null +++ b/training/TRAINING.md @@ -0,0 +1,274 @@ +# Training Guide + +How to train a LoRA adapter for Apple's on-device 3B Foundation Model using the hunch dataset. + +## Prerequisites + +1. **Apple Developer Program** ($99/year) — needed to download the training toolkit +2. **Adapter training toolkit** — download from [developer.apple.com/apple-intelligence/foundation-models-adapter/](https://developer.apple.com/apple-intelligence/foundation-models-adapter/) +3. **Google account** — for Colab (free tier works for QLoRA and fp16 LoRA) + +## Files + +``` +training/ +├── train_lora.ipynb # LoRA training notebook (needs A100) +├── train_lora_fp16.ipynb # fp16 LoRA training notebook (works on free T4) +├── train_qlora.ipynb # QLoRA training notebook (works on free T4, recommended) +├── train_qlora_full.py # QLoRA training script (T4 or Mac) +├── train_qlora_test.py # Quick smoke test (load model, one forward/backward pass) +├── prepare_data.py # Converts hunch bank → training JSONL +├── bench_mps.py # Metal vs CPU fallback benchmark +└── TRAINING.md # This file +``` + +## Quick Start + +### 1. Download the toolkit + +Download from developer.apple.com, extract into this directory: + +``` +training/adapter_training_toolkit_v26_0_0/ +├── assets/ # Base model weights (12GB) +├── examples/ # Training scripts +├── export/ # .fmadapter export +└── requirements.txt +``` + +### 2. Choose your path + +| Path | GPU | Cost | VRAM | Time (overrides) | Time (full bank) | +|------|-----|------|------|------------------|------------------| +| **QLoRA on Mac** | Apple Silicon | **Free, local** | **~5GB** | **~34 min** | ~hours | +| QLoRA on Colab | T4 16GB | Free | ~5GB | ~5 min | ~1.7 hours | +| fp16 LoRA on Colab | T4 16GB | Free | ~8.5GB | ~10 min | ~2 hours | +| LoRA on Colab | A100 40GB | Colab Pro ($10/mo) | ~15GB | ~5 min | ~2.5 hours | + +**QLoRA is recommended.** Same adapter quality as full LoRA, lowest memory, fewest patches. Mac training is ~7x slower than T4 but fully local. + +### Path A: Train on Mac (recommended for small datasets) + +```bash +cd training/adapter_training_toolkit_v26_0_0 +source venv/bin/activate + +# Install native Metal kernel support for bitsandbytes +pip install kernels +pip install --force-reinstall git+https://github.com/bitsandbytes-foundation/bitsandbytes.git + +# Prepare data and train +cd .. +python3 prepare_data.py --sources override +python3 train_qlora_full.py --epochs 20 --batch-size 8 + +# Export — bitsandbytes from main pulls in PyTorch 2.11, but coremltools 8.3.0 +# ships native C extensions only for Python ≤3.13 and PyTorch ≤2.5. +# Create a separate env with compatible versions: +cd adapter_training_toolkit_v26_0_0 +python3.12 -m venv export-env +source export-env/bin/activate +pip install torch==2.5.0 coremltools==8.3.0 +python3 -m export.export_fmadapter \ + --adapter-name hunch_qlora \ + --checkpoint ../qlora-checkpoints/adapter-final.pt \ + --output-dir ../qlora-checkpoints/ +``` + +Notes: +- Requires bitsandbytes from git main (pre-v0.50.0) with native MPS kernels (PR #1875) +- The `kernels` package downloads pre-compiled Metal shaders from HuggingFace Hub at runtime +- Don't use `bnb_4bit_use_double_quant=True` — not wired for MPS yet +- ~34 min for 20 epochs of 96 examples on M4, ~5GB GPU peak. Full bank (~19k) would take hours + +### Path B: Train on Colab + +Upload to Google Drive: + +``` +My Drive/hunch-training/ +├── adapter_training_toolkit_v26_0_0/ # The extracted toolkit +├── prepare_data.py # From this directory +├── train_qlora_full.py # From this directory (for QLoRA) +├── tldr_bank.db # From ../bank/ +└── prompts.jsonl # From ../benchmark/ +``` + +Choose a notebook: + +| Notebook | GPU | Patches | +|----------|-----|---------| +| `train_qlora.ipynb` | T4 16GB (free) | 1 (rms_norm) | +| `train_lora_fp16.ipynb` | T4 16GB (free) | 3 (mmap, grad scaling, rms_norm) | +| `train_lora.ipynb` | A100 40GB (Pro) | None | + +Open in Colab via the VS Code extension or upload directly to [colab.research.google.com](https://colab.research.google.com). Run cells in order. + +### 3. Test on-device + +```bash +hunch --adapter path/to/hunch.fmadapter "find files changed in the last hour" +``` + +## Training Data + +`prepare_data.py` converts the hunch bank into training JSONL: + +```bash +python3 prepare_data.py # full bank (~19k train / ~3k eval) +python3 prepare_data.py --sources override # overrides only (~96 examples, recommended) +python3 prepare_data.py --sources tldr-osx # macOS-specific tldr pages (~1k) +python3 prepare_data.py --sources override,tldr-osx # overrides + macOS (~1.1k) +python3 prepare_data.py --stats # show dataset statistics +``` + +Each training example: +```json +[ + {"role": "system", "content": "Output a single shell command for zsh on macOS..."}, + {"role": "user", "content": "find files changed in the last hour"}, + {"role": "assistant", "content": "find . -mmin -60"} +] +``` + +- Benchmark prompts excluded to avoid data leakage +- Override and tldr-osx entries appear in both splits + +**Use `--sources override` for best results.** Adapters trained on ~96 curated overrides (~5 min on T4) significantly outperform adapters trained on the full 19k bank (~1.7 hours on T4). Quality over quantity — see README.md for benchmark results. + +## How Each Approach Works + +### QLoRA (recommended) + +Quantizes the frozen base model to 4-bit NF4 via `bitsandbytes`, and uses `mmap=True` loading to avoid the 12GB CPU RAM spike. Only `nn.Linear` layers are quantized (attention Q/K/V/O, FFN — ~90% of params). Embeddings, norms, and other layers stay in fp16. Adapters train in fp32. + +Memory breakdown: +- CPU RAM peak: **~1GB** (mmap reads weights from disk on demand) +- Base model Linear layers: ~1.5GB (NF4) +- Base model non-Linear: ~0.65GB (fp16) +- Adapters + gradients + optimizer: ~0.6GB (fp32) +- Activations: ~2-3GB +- **GPU total: ~5GB** + +Only one patch needed: rms_norm dtype fix for mixed fp16/fp32/quantized tensors through norm layers. + +### fp16 LoRA + +Forces the base model to fp16 and uses `mmap=True` loading. Both changes are patches to Apple's toolkit — the default loads fp32 without mmap, which requires ~24GB CPU RAM and 12GB GPU. Requires three patches total. + +Memory breakdown: +- CPU RAM peak: **~1GB** (mmap, vs ~24GB without) +- Base model: ~6GB (fp16, vs ~12GB fp32) +- Adapters + gradients + optimizer: ~0.6GB (fp32) +- Activations: ~2-3GB +- **GPU total: ~8.5GB** + +**Patch 1 — `utils.py`: mmap + fp16 model + fp32 adapters** +- `mmap=True` on `torch.load`: reads weights from disk on demand instead of loading 12GB into RAM +- `model_config.dtype = torch.float16`: creates the model in fp16 (6GB GPU instead of 12GB) +- Casts adapter weights back to fp32: GradScaler needs fp32 gradients + +**Patch 2 — `train_adapter.py`: gradient scaling for f16-mixed** +- Apple's code only enables GradScaler for a `"f16"` precision mode that isn't exposed as a CLI option +- When running with `f16-mixed` and an fp16 model, gradients overflow without scaling → loss = NaN +- Fix: enable GradScaler for `f16-mixed` too + +**Patch 3 — `tamm/layers/functional.py`: rms_norm dtype fix** +- `torch.rms_norm` requires input and weight to have the same dtype +- fp16 model has fp16 weights, but mixed-precision casts input to fp32 +- Fix: cast weight to match input dtype before calling rms_norm + +All patches are applied automatically by the notebook. To restore originals, re-copy from the toolkit on Drive. + +### Standard LoRA + +Loads the base model in fp32. No patches needed. The ~15GB GPU footprint barely fits a T4 (16GB) with no headroom, but loading crashes first — the 12GB checkpoint must be fully loaded into CPU RAM alongside the model, peaking at ~24GB. T4 only has 12GB system RAM. The A100 works because it has 80GB system RAM. + +Memory breakdown: +- CPU RAM peak: **~24GB** during loading (12GB model + 12GB state dict simultaneously — no mmap) +- Base model on GPU: ~12GB (fp32) +- Adapters + gradients + optimizer: ~0.6GB (fp32) +- Activations: ~2-3GB (fp32) +- **GPU total: ~15GB** + +The CPU RAM spike is why standard LoRA OOMs on a 24GB Mac and on T4 (12GB system RAM). The A100's 80GB system RAM hides this. fp16 LoRA and QLoRA avoid this with `mmap=True` loading (~1GB RAM peak instead of 24GB). + +## Export + +The export step packages the LoRA weights into a `.fmadapter` file that can be loaded on-device: + +```bash +cd adapter_training_toolkit_v26_0_0 +python3 -m export.export_fmadapter \ + --adapter-name hunch \ + --checkpoint ../checkpoints/adapter-final.pt \ + --output-dir ../exports/ +``` + +**Note for Mac training:** The training venv has PyTorch 2.11 (from bitsandbytes main) which is too new for coremltools. Export in a separate Python 3.12 environment — see Path A in Quick Start above. + +Output is ~130MB. The adapter name can only contain letters, numbers, and underscores. + +**Do not modify the export code** — the `.fmadapter` format must match exactly for on-device compatibility. + +The `.fmadapter` format doesn't record training precision — adapters trained via QLoRA, fp16 LoRA, or fp32 LoRA all export identically and load the same on-device. + +## Loading in Swift + +```swift +let adapter = try SystemLanguageModel.Adapter(fileURL: localURL) +let model = SystemLanguageModel(adapter: adapter) +let session = LanguageModelSession(model: model) +let response = try await session.respond(to: "find files changed in the last hour") +``` + +No entitlement needed for local testing. Entitlement required only for App Store distribution. + +## Key Training Parameters + +| Parameter | Override-only (recommended) | Full bank | +|-----------|---------------------------|-----------| +| `--batch-size` | 8 | 8 | +| `--learning-rate` | 1e-4 | 1e-4 | +| `--epochs` | 20 | 3 | +| `--sources` (prepare_data.py) | `override` | (default) | + +These apply to all three approaches (LoRA, fp16 LoRA, QLoRA). Override-only trains on ~96 examples and needs more epochs to converge. Full bank has ~19k examples and overfits after 3. + +## On-Device Accuracy + +All three approaches produce comparable adapters. QLoRA is recommended — same quality, lowest cost. + +| Approach | + Retrieval | Standalone | Trained on | +|---|---|---|---| +| QLoRA (Mac) | ~86% | ~76% | Local | +| QLoRA (T4) | ~85% | ~74% | T4 free | +| LoRA (A100) | ~85% | ~72.5% | A100 | +| Retrieval only | ~79% | — | — | +| Bare model | — | ~41% | — | + +Full benchmark details and analysis in README.md. + +## Known Issues + +### Adapter disk space leak + +`TGOnDeviceInferenceProviderService` caches a full copy of the adapter (~160MB) in a SIP-protected directory on every process invocation. The copies are never cleaned up. Running benchmarks (hundreds of adapter calls) can consume tens of GB invisibly. + +**Workaround:** Use `hunch --batch` to run multiple prompts in a single process (1 cached copy instead of 1 per prompt). To reclaim space, boot Recovery Mode and delete `/Volumes/Data/private/var/db/AppleIntelligencePlatform/AppModelAssets/*`. + +To reclaim space, boot Recovery Mode and run `rm -rf /Volumes/Data/private/var/db/AppleIntelligencePlatform/AppModelAssets/*`. The service recreates what it needs on the next adapter load. + +## Troubleshooting + +**OOM on T4 (QLoRA):** Make sure `bitsandbytes` is installed and the model is being quantized. Check for "Quantized 280 layers to NF4" in the output. + +**OOM on T4 (fp16 LoRA):** Make sure all three patches are applied. Run the patch cell before training. + +**loss = NaN:** The rms_norm patch didn't apply, or the pycache is stale. The notebook clears pycache automatically, but if you see NaN, restart the kernel and re-run from the patch cell. + +**Return code -9:** The OS killed the process for memory. On T4, this means system RAM (12GB) is full. Make sure mmap is patched (check for `mmap=True` in utils.py). + +**Adapter name error:** Use only letters, numbers, and underscores. No hyphens. + +**coremltools warnings:** Ignore them. The export works despite the warnings. diff --git a/training/bench_mps.py b/training/bench_mps.py new file mode 100644 index 0000000..a1bd6c4 --- /dev/null +++ b/training/bench_mps.py @@ -0,0 +1,191 @@ +#!/usr/bin/env python3 +""" +Benchmark QLoRA training on MPS: Metal kernels vs CPU fallback. + +Measures load time, training throughput, and memory usage. +Run with both bitsandbytes versions to compare: + + # With Metal kernels (bitsandbytes from main) + python3 bench_mps.py --epochs 3 --label metal + + # Without Metal kernels (bitsandbytes 0.49.2) + python3 bench_mps.py --epochs 3 --label cpu-fallback + + # Longer sequences (override + tldr-osx) + python3 bench_mps.py --epochs 3 --sources override,tldr-osx --label metal-long + +Results are appended to bench_mps_results.jsonl for comparison. +""" + +import sys +import os +import gc +import json +import time +import argparse +import psutil +from pathlib import Path + +TOOLKIT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "adapter_training_toolkit_v26_0_0") +sys.path.insert(0, TOOLKIT_DIR) +TRAINING_DIR = Path(__file__).parent + +import torch +import torch.nn as nn +from torch.utils.data import DataLoader + + +def mem_stats(): + ram = psutil.Process().memory_info().rss / 1024**3 + gpu = 0 + if torch.backends.mps.is_available(): + gpu = torch.mps.current_allocated_memory() / 1024**3 + elif torch.cuda.is_available(): + gpu = torch.cuda.memory_allocated() / 1024**3 + cpu_pct = psutil.cpu_percent(interval=None) + return {"ram_gb": round(ram, 2), "gpu_gb": round(gpu, 2), "cpu_pct": cpu_pct} + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--epochs", type=int, default=3) + parser.add_argument("--batch-size", type=int, default=8) + parser.add_argument("--sources", default="override") + parser.add_argument("--label", required=True, help="Label for this run (e.g. 'metal', 'cpu-fallback')") + parser.add_argument("--repeat", type=int, default=1, help="Number of full runs to average") + args = parser.parse_args() + + # Check bitsandbytes version + import bitsandbytes as bnb + bnb_version = getattr(bnb, '__version__', 'unknown') + print(f"bitsandbytes: {bnb_version}") + print(f"Label: {args.label}") + print(f"Sources: {args.sources}") + print(f"Epochs: {args.epochs}, Batch: {args.batch_size}, Repeats: {args.repeat}") + print() + + # Prepare data if needed + train_path = TRAINING_DIR / "train.jsonl" + if not train_path.exists(): + os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py --sources {args.sources}") + else: + # Regenerate with correct sources + os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py --sources {args.sources}") + + # Import training components + from train_qlora_full import ( + CommandDataset, collate_fn, load_model_qlora, patch_rms_norm, + train_epoch, evaluate + ) + from tamm.tokenizers.afm import AFMTokenizer + + results = [] + + for run in range(1, args.repeat + 1): + print(f"{'='*60}") + print(f" Run {run}/{args.repeat}") + print(f"{'='*60}") + + # Start CPU monitoring + psutil.cpu_percent(interval=None) # reset + + # Phase 1: Load & quantize + t_load_start = time.time() + patch_rms_norm() + device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cuda") + model = load_model_qlora(device) + t_load = time.time() - t_load_start + mem_after_load = mem_stats() + print(f" Load+quantize: {t_load:.1f}s | {mem_after_load}") + + # Phase 2: Setup data + tokenizer = AFMTokenizer(str(Path(TOOLKIT_DIR) / "assets" / "tokenizer.model")) + train_dataset = CommandDataset(str(train_path), tokenizer) + eval_dataset = CommandDataset(str(TRAINING_DIR / "eval.jsonl"), tokenizer) + train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, collate_fn=collate_fn) + eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn) + print(f" Data: {len(train_dataset)} train, {len(eval_dataset)} eval, {len(train_loader)} batches/epoch") + + # Phase 3: Train + optimizer = torch.optim.AdamW( + [p for p in model.parameters() if p.requires_grad], + lr=1e-4, weight_decay=0.01 + ) + scaler = torch.amp.GradScaler(device=str(device)) if (torch.cuda.is_available() or torch.backends.mps.is_available()) else None + + epoch_times = [] + epoch_losses = [] + mem_during_training = [] + + for epoch in range(args.epochs): + t_epoch_start = time.time() + train_loss = train_epoch(model, train_loader, optimizer, device, epoch, scaler) + t_epoch = time.time() - t_epoch_start + epoch_times.append(t_epoch) + epoch_losses.append(train_loss) + mem = mem_stats() + mem_during_training.append(mem) + + batches = len(train_loader) + it_s = batches / t_epoch + s_it = t_epoch / batches + print(f" Epoch {epoch+1}: {t_epoch:.1f}s ({s_it:.2f}s/it, {it_s:.2f}it/s) loss={train_loss:.4f} | {mem}") + + # Phase 4: Eval + t_eval_start = time.time() + eval_loss = evaluate(model, eval_loader, device) + t_eval = time.time() - t_eval_start + print(f" Eval: {t_eval:.1f}s loss={eval_loss:.4f}") + + total_time = t_load + sum(epoch_times) + t_eval + avg_epoch = sum(epoch_times) / len(epoch_times) + avg_it_s = len(train_loader) / avg_epoch + avg_s_it = avg_epoch / len(train_loader) + + run_result = { + "label": args.label, + "run": run, + "bnb_version": bnb_version, + "sources": args.sources, + "epochs": args.epochs, + "batch_size": args.batch_size, + "train_examples": len(train_dataset), + "batches_per_epoch": len(train_loader), + "load_time_s": round(t_load, 1), + "avg_epoch_s": round(avg_epoch, 1), + "avg_s_per_it": round(avg_s_it, 2), + "avg_it_per_s": round(avg_it_s, 2), + "total_time_s": round(total_time, 1), + "final_train_loss": round(epoch_losses[-1], 4), + "eval_loss": round(eval_loss, 4), + "mem_after_load": mem_after_load, + "mem_training": mem_during_training[-1], + "epoch_times": [round(t, 1) for t in epoch_times], + } + results.append(run_result) + + print(f"\n Summary: {avg_s_it:.2f}s/it ({avg_it_s:.2f}it/s), total {total_time:.0f}s") + print() + + # Cleanup for next run + del model, optimizer, scaler, train_loader, eval_loader + gc.collect() + if torch.backends.mps.is_available(): + torch.mps.empty_cache() + + # Save results + results_file = TRAINING_DIR / "bench_mps_results.jsonl" + with open(results_file, "a") as f: + for r in results: + f.write(json.dumps(r) + "\n") + print(f"Results appended to {results_file}") + + # Print comparison-ready summary + if len(results) > 1: + avg_it = sum(r["avg_s_per_it"] for r in results) / len(results) + avg_total = sum(r["total_time_s"] for r in results) / len(results) + print(f"\nAverage across {len(results)} runs: {avg_it:.2f}s/it, {avg_total:.0f}s total") + + +if __name__ == "__main__": + main() diff --git a/training/prepare_data.py b/training/prepare_data.py new file mode 100644 index 0000000..dea923c --- /dev/null +++ b/training/prepare_data.py @@ -0,0 +1,193 @@ +#!/usr/bin/env python3 +"""Convert the hunch bank into training data for Apple FM adapter training. + +Produces JSONL files in the format expected by Apple's adapter training toolkit: + [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}] + +Usage: + python3 prepare_data.py # generate train.jsonl + eval.jsonl + python3 prepare_data.py --stats # show dataset statistics + python3 prepare_data.py --eval-split 0.1 # 10% eval split (default) +""" + +import json +import sqlite3 +import random +import argparse +from pathlib import Path + +BANK_DB = Path(__file__).parent.parent / "bank" / "tldr_bank.db" +BENCHMARK_PROMPTS = Path(__file__).parent.parent / "benchmark" / "prompts.jsonl" +TRAIN_FILE = Path(__file__).parent / "train.jsonl" +EVAL_FILE = Path(__file__).parent / "eval.jsonl" + +SYSTEM_PROMPT = "Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command." + + +def load_bank(): + """Load all Q/A pairs from the bank.""" + conn = sqlite3.connect(str(BANK_DB)) + rows = conn.execute( + "SELECT question, answer, cmd, source FROM bank" + ).fetchall() + conn.close() + return [{"q": q, "a": a, "cmd": cmd, "source": src} for q, a, cmd, src in rows] + + +def load_benchmark_prompts(): + """Load benchmark prompts to exclude from training data.""" + if not BENCHMARK_PROMPTS.exists(): + return set() + prompts = set() + with open(BENCHMARK_PROMPTS) as f: + for line in f: + p = json.loads(line) + prompts.add(p["prompt"].lower().strip()) + return prompts + + +def to_training_example(entry): + """Convert a bank entry to Apple FM training format.""" + return [ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": entry["q"]}, + {"role": "assistant", "content": entry["a"]}, + ] + + +def prepare_dataset(eval_split=0.1, exclude_benchmark=True, seed=42, sources=None): + """Prepare train/eval splits from the bank. + + Args: + sources: filter by source. Options: + None or "all" — everything (default) + "override" — overrides only (~130 examples) + "macos" — overrides + tldr-osx (~1k examples) + "override,tldr-osx" — comma-separated list + """ + bank = load_bank() + print(f"Loaded {len(bank)} entries from bank") + + # Filter by source if specified + if sources and sources != "all": + allowed = set(s.strip() for s in sources.split(",")) + # "macos" is a shorthand for override + tldr-osx + if "macos" in allowed: + allowed.discard("macos") + allowed.update(["override", "tldr-osx"]) + before = len(bank) + bank = [e for e in bank if e["source"] in allowed] + print(f"Filtered to sources {allowed}: {len(bank)} entries (from {before})") + + # Count by source + by_source = {} + for entry in bank: + by_source[entry["source"]] = by_source.get(entry["source"], 0) + 1 + for src, count in sorted(by_source.items()): + print(f" {src}: {count}") + + # Exclude benchmark prompts from training to avoid data leakage + if exclude_benchmark: + benchmark = load_benchmark_prompts() + before = len(bank) + bank = [e for e in bank if e["q"].lower().strip() not in benchmark] + excluded = before - len(bank) + print(f"Excluded {excluded} entries matching benchmark prompts") + + # Deduplicate by (question, answer) + seen = set() + unique = [] + for entry in bank: + key = (entry["q"].lower().strip(), entry["a"].strip()) + if key not in seen: + seen.add(key) + unique.append(entry) + print(f"After dedup: {len(unique)} unique entries (removed {len(bank) - len(unique)})") + bank = unique + + # Split into train/eval + random.seed(seed) + random.shuffle(bank) + eval_size = max(int(len(bank) * eval_split), 1) + eval_data = bank[:eval_size] + train = bank[eval_size:] + + # For small datasets, put everything in both + if len(bank) < 500: + train = bank + eval_data = bank + print(f"Small dataset — using all {len(bank)} examples for both train and eval") + else: + print(f"\nDataset split:") + print(f" Train: {len(train)} examples") + print(f" Eval: {len(eval_data)} examples") + + return train, eval_data + + +def write_jsonl(data, path): + """Write training data in Apple FM format.""" + with open(path, "w") as f: + for entry in data: + example = to_training_example(entry) + f.write(json.dumps(example) + "\n") + print(f"Wrote {len(data)} examples to {path}") + + +def show_stats(data, label): + """Show dataset statistics.""" + by_source = {} + by_cmd = {} + total_q_len = 0 + total_a_len = 0 + + for entry in data: + by_source[entry["source"]] = by_source.get(entry["source"], 0) + 1 + by_cmd[entry["cmd"]] = by_cmd.get(entry["cmd"], 0) + 1 + total_q_len += len(entry["q"]) + total_a_len += len(entry["a"]) + + print(f"\n{label} ({len(data)} examples):") + print(f" By source:") + for src, count in sorted(by_source.items(), key=lambda x: -x[1]): + print(f" {src}: {count}") + print(f" Unique commands: {len(by_cmd)}") + print(f" Avg question length: {total_q_len / len(data):.0f} chars") + print(f" Avg answer length: {total_a_len / len(data):.0f} chars") + print(f" Top commands:") + for cmd, count in sorted(by_cmd.items(), key=lambda x: -x[1])[:10]: + print(f" {cmd}: {count}") + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--eval-split", type=float, default=0.1) + parser.add_argument("--stats", action="store_true") + parser.add_argument("--no-exclude-benchmark", action="store_true") + parser.add_argument("--sources", default=None, help="Filter sources: override, macos, tldr-osx, tldr-common, or all") + args = parser.parse_args() + + train, eval_data = prepare_dataset( + eval_split=args.eval_split, + exclude_benchmark=not args.no_exclude_benchmark, + sources=args.sources, + ) + + if args.stats: + show_stats(train, "Train") + show_stats(eval_data, "Eval") + else: + write_jsonl(train, TRAIN_FILE) + write_jsonl(eval_data, EVAL_FILE) + + # Show a few examples + print("\nSample training examples:") + for entry in train[:3]: + ex = to_training_example(entry) + print(f" user: {ex[1]['content'][:60]}") + print(f" asst: {ex[2]['content'][:60]}") + print() + + +if __name__ == "__main__": + main() diff --git a/training/train_lora.ipynb b/training/train_lora.ipynb new file mode 100644 index 0000000..2291e72 --- /dev/null +++ b/training/train_lora.ipynb @@ -0,0 +1,322 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# LoRA: Training Apple's 3B Model on A100\n", + "\n", + "Standard LoRA training using Apple's adapter toolkit. Requires A100 (40GB GPU).\n", + "For free T4 training, see `train_qlora.ipynb`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Setup\n", + "\n", + "Upload to `My Drive/hunch-training/`:\n", + "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n", + "- `prepare_data.py`, `tldr_bank.db`, `prompts.jsonl`" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mounted at /content/drive\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m117.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m40.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m10.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m93.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCUDA: True\n", + "GPU: NVIDIA A100-SXM4-40GB\n" + ] + } + ], + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')\n", + "\n", + "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n", + "WORK_DIR = '/content/hunch-training'\n", + "\n", + "!mkdir -p {WORK_DIR}\n", + "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n", + "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n", + "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n", + "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n", + "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n", + "\n", + "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n", + "\n", + "import torch\n", + "print(f'CUDA: {torch.cuda.is_available()}')\n", + "if torch.cuda.is_available():\n", + " print(f'GPU: {torch.cuda.get_device_name(0)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Prepare training data" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded 21478 entries from bank\n", + "Filtered to sources {'override'}: 134 entries (from 21478)\n", + " override: 134\n", + "Excluded 38 entries matching benchmark prompts\n", + "After dedup: 96 unique entries (removed 0)\n", + "Small dataset — using all 96 examples for both train and eval\n", + "Wrote 96 examples to /content/hunch-training/train.jsonl\n", + "Wrote 96 examples to /content/hunch-training/eval.jsonl\n", + "\n", + "Sample training examples:\n", + " user: show response headers\n", + " asst: curl -I https://example.com\n", + "\n", + " user: dns lookup for a domain\n", + " asst: dig example.com\n", + "\n", + " user: record shell session to file\n", + " asst: script session.log\n", + "\n" + ] + } + ], + "source": [ + "!cd {WORK_DIR} && python3 prepare_data.py --sources override" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Train\n", + "\n", + "No patches needed for A100. ~25 min/epoch, ~1.5 hours total.\n", + "\n", + "**Note:** lr=1e-3 diverged in testing. Use 1e-4." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Fine-tuning adapters with configuration: \n", + "AdapterTrainingConfiguration(epochs=20, learning_rate=0.0001, batch_size=8, linear_warmup_epochs=1, gradient_accumulation_steps=1, enable_activation_checkpointing=True, precision='bf16-mixed', compile_model=False, weight_decay=0.01, clip_grad_norm=1.0, max_sequence_length=None, fixed_sized_sequences=False, pack_sequences=False, loss_update_frequency=3)\n", + "Loading base model on cuda with precision torch.float32\n", + "/usr/local/lib/python3.12/dist-packages/tamm/layers/flash_attention.py:78: UserWarning: Failed to import flash-attn for Flash attention. Using flash attention may lead to significantly faster training. Please refer to tamm-scripts/install_flash_attn.sh for instructions.\n", + " _warnings.warn(\n", + "Total parameters 3178001792\n", + "Total trainable parameters 66633728\n", + "Gradient scaling is enabled: False\n", + "Epoch 1/20\n", + "Training: 100% 12/12 [00:08<00:00, 1.42it/s, loss=1.64]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.12it/s, loss=1.05]\n", + "Epoch 2/20\n", + "INFO:examples.utils:Epoch 2/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.69it/s, loss=0.795]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.26it/s, loss=0.32] \n", + "Epoch 3/20\n", + "INFO:examples.utils:Epoch 3/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.68it/s, loss=0.283]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.27it/s, loss=0.116]\n", + "Epoch 4/20\n", + "INFO:examples.utils:Epoch 4/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.69it/s, loss=0.0817]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.25it/s, loss=0.0388] \n", + "Epoch 5/20\n", + "INFO:examples.utils:Epoch 5/20\n", + "Training: 100% 12/12 [00:06<00:00, 1.72it/s, loss=0.0895]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.27it/s, loss=0.0261] \n", + "Epoch 6/20\n", + "INFO:examples.utils:Epoch 6/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.69it/s, loss=0.0223]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.28it/s, loss=0.0127]\n", + "Epoch 7/20\n", + "INFO:examples.utils:Epoch 7/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.68it/s, loss=0.0104]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.23it/s, loss=0.0147]\n", + "Epoch 8/20\n", + "INFO:examples.utils:Epoch 8/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.70it/s, loss=0.0163] \n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.37it/s, loss=0.00656]\n", + "Epoch 9/20\n", + "INFO:examples.utils:Epoch 9/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.70it/s, loss=0.0194] \n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.26it/s, loss=0.000864]\n", + "Epoch 10/20\n", + "INFO:examples.utils:Epoch 10/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.68it/s, loss=0.000877]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.29it/s, loss=0.000607]\n", + "Epoch 11/20\n", + "INFO:examples.utils:Epoch 11/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.69it/s, loss=0.000526]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.22it/s, loss=0.000396]\n", + "Epoch 12/20\n", + "INFO:examples.utils:Epoch 12/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.70it/s, loss=0.000395]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.32it/s, loss=0.000287]\n", + "Epoch 13/20\n", + "INFO:examples.utils:Epoch 13/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.71it/s, loss=0.00031]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.26it/s, loss=0.000221]\n", + "Epoch 14/20\n", + "INFO:examples.utils:Epoch 14/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.70it/s, loss=0.000229]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.34it/s, loss=0.000198]\n", + "Epoch 15/20\n", + "INFO:examples.utils:Epoch 15/20\n", + "Training: 100% 12/12 [00:06<00:00, 1.72it/s, loss=0.000201]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.30it/s, loss=0.000169]\n", + "Epoch 16/20\n", + "INFO:examples.utils:Epoch 16/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.70it/s, loss=0.000196]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.19it/s, loss=0.000161]\n", + "Epoch 17/20\n", + "INFO:examples.utils:Epoch 17/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.69it/s, loss=0.000155]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.25it/s, loss=0.000165]\n", + "Epoch 18/20\n", + "INFO:examples.utils:Epoch 18/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.68it/s, loss=0.000159]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.22it/s, loss=0.000159]\n", + "Epoch 19/20\n", + "INFO:examples.utils:Epoch 19/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.70it/s, loss=0.00016] \n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.30it/s, loss=0.000156]\n", + "Epoch 20/20\n", + "INFO:examples.utils:Epoch 20/20\n", + "Training: 100% 12/12 [00:07<00:00, 1.69it/s, loss=0.000163]\n", + "Evaluation: 100% 12/12 [00:02<00:00, 5.28it/s, loss=0.000156]\n" + ] + } + ], + "source": [ + "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n", + " --train-data ../train.jsonl \\\n", + " --eval-data ../eval.jsonl \\\n", + " --epochs 20 \\\n", + " --learning-rate 1e-4 \\\n", + " --batch-size 8 \\\n", + " --precision bf16-mixed \\\n", + " --activation-checkpointing \\\n", + " --checkpoint-dir ../lora-override-checkpoints/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Save checkpoints" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Checkpoints saved to Drive\n" + ] + } + ], + "source": [ + "!cp -r {WORK_DIR}/lora-override-checkpoints {DRIVE_DIR}/lora-override-checkpoints\n", + "!echo 'Checkpoints saved to Drive'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Export .fmadapter" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "scikit-learn version 1.6.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.\n", + "XGBoost version 3.2.0 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.\n", + "2026-04-15 17:21:10.930166: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", + "2026-04-15 17:21:10.949095: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", + "E0000 00:00:1776273670.972532 4269 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "E0000 00:00:1776273670.980305 4269 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "W0000 00:00:1776273671.000652 4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "W0000 00:00:1776273671.000678 4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "W0000 00:00:1776273671.000681 4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "W0000 00:00:1776273671.000684 4269 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "2026-04-15 17:21:11.005962: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", + "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", + "TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.\n", + "Torch version 2.10.0+cu128 has not been tested with coremltools. You may run into unexpected errors. Torch 2.5.0 is the most recent version that has been tested.\n", + "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'\n", + "total 4.0K\n", + "drwxr-xr-x 2 root root 4.0K Apr 15 17:21 hunch.fmadapter\n", + "Adapter exported and saved to Drive\n" + ] + } + ], + "source": [ + "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n", + " --adapter-name hunch \\\n", + " --checkpoint ../lora-override-checkpoints/adapter-final.pt \\\n", + " --output-dir ../lora-override-exports/\n", + "\n", + "!ls -lh {WORK_DIR}/lora-override-exports/\n", + "!cp -r {WORK_DIR}/lora-override-exports {DRIVE_DIR}/lora-override-exports\n", + "!echo 'Adapter exported and saved to Drive'" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/training/train_lora_fp16.ipynb b/training/train_lora_fp16.ipynb new file mode 100644 index 0000000..f8e498f --- /dev/null +++ b/training/train_lora_fp16.ipynb @@ -0,0 +1,228 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": "# fp16 LoRA: Training Apple's 3B Model on a Free T4\n\nThree patches to Apple's adapter training toolkit enable training on Colab's free T4 GPU (16GB):\n\n1. **mmap loading** — reads weights from disk on demand, avoids 12GB system RAM spike\n2. **fp16 model + fp32 adapters** — halves GPU memory from 12GB to 6GB\n3. **rms_norm fix + gradient scaling** — fixes dtype mismatches that cause NaN\n\nResult: ~2 hours training on free T4. This is half-precision LoRA (fp16 base), not true QLoRA (4-bit NF4)." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Setup\n", + "\n", + "Upload to `My Drive/hunch-training/`:\n", + "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n", + "- `prepare_data.py`, `tldr_bank.db`, `prompts.jsonl`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')\n", + "\n", + "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n", + "WORK_DIR = '/content/hunch-training'\n", + "\n", + "!mkdir -p {WORK_DIR}\n", + "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n", + "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n", + "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n", + "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n", + "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n", + "\n", + "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n", + "\n", + "import torch\n", + "print(f'CUDA: {torch.cuda.is_available()}')\n", + "if torch.cuda.is_available():\n", + " print(f'GPU: {torch.cuda.get_device_name(0)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Prepare training data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!cd {WORK_DIR} && python3 prepare_data.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Apply patches\n", + "\n", + "Three patches make training fit on T4 (16GB GPU, 12GB RAM):\n", + "\n", + "**Patch 1 — `utils.py`:** mmap loading (0 RAM), fp16 model (6GB GPU), fp32 adapters (stable gradients)\n", + "\n", + "**Patch 2 — `train_adapter.py`:** enable gradient scaling for f16-mixed (prevents NaN overflow)\n", + "\n", + "**Patch 3 — `tamm/layers/functional.py`:** cast rms_norm weight to match input dtype (prevents NaN from dtype mismatch)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import glob, shutil\n", + "\n", + "# --- Patch 1: utils.py ---\n", + "# Restore clean copy first\n", + "!cp {DRIVE_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py \\\n", + " {WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py\n", + "\n", + "utils_path = f'{WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/utils.py'\n", + "code = open(utils_path).read()\n", + "\n", + "# 1a: Force fp16 model creation (6GB instead of 12GB on GPU)\n", + "code = code.replace(\n", + " 'model_config.dtype = dtype or model_config.dtype',\n", + " 'model_config.dtype = torch.float16'\n", + ")\n", + "\n", + "# 1b: mmap loading (weights stay on disk, ~0 system RAM)\n", + "code = code.replace(\n", + " ''' with Path(base_model_checkpoint_path).open(\"rb\") as f:\\n sd = torch.load(f, map_location=device, weights_only=False)\\n _ = model.load_state_dict(sd, strict=True)''',\n", + " ''' sd = torch.load(str(base_model_checkpoint_path), map_location=device, mmap=True, weights_only=False)\\n _ = model.load_state_dict(sd, strict=True)\\n del sd; import gc; gc.collect()'''\n", + ")\n", + "\n", + "# 1c: Keep adapter weights in fp32 (GradScaler needs fp32 gradients)\n", + "code = code.replace(\n", + " ' return model.to(device=device, dtype=model_config.dtype)',\n", + " ''' model = model.to(device=device, dtype=model_config.dtype)\n", + "\n", + " # Keep adapter weights in fp32 for stable training\n", + " for name, parameter in model.named_parameters():\n", + " if \"adapter\" in name:\n", + " parameter.data = parameter.data.float()\n", + "\n", + " return model'''\n", + ")\n", + "\n", + "open(utils_path, 'w').write(code)\n", + "print('Patch 1 applied: utils.py (mmap + fp16 + fp32 adapters)')\n", + "\n", + "# --- Patch 2: train_adapter.py ---\n", + "!cp {DRIVE_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py \\\n", + " {WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py\n", + "\n", + "ta_path = f'{WORK_DIR}/adapter_training_toolkit_v26_0_0/examples/train_adapter.py'\n", + "code = open(ta_path).read()\n", + "code = code.replace(\n", + " 'return self.precision == \"f16\"',\n", + " 'return self.precision in (\"f16\", \"f16-mixed\")'\n", + ")\n", + "open(ta_path, 'w').write(code)\n", + "print('Patch 2 applied: train_adapter.py (gradient scaling for f16-mixed)')\n", + "\n", + "# --- Patch 3: tamm rms_norm ---\n", + "norm_files = glob.glob(f'{WORK_DIR}/**/tamm/layers/functional.py', recursive=True)\n", + "norm_files += glob.glob('/usr/local/lib/**/tamm/layers/functional.py', recursive=True)\n", + "for nf in norm_files:\n", + " code = open(nf).read()\n", + " if 'weight.to(tensor.dtype)' not in code:\n", + " old = ' tensor = _torch_compatibility.rms_norm(\\n tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\\n )'\n", + " new = ' if weight is not None and weight.dtype != tensor.dtype:\\n weight = weight.to(tensor.dtype)\\n tensor = _torch_compatibility.rms_norm(\\n tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\\n )'\n", + " code = code.replace(old, new)\n", + " open(nf, 'w').write(code)\n", + " print(f'Patch 3 applied: {nf} (rms_norm dtype fix)')\n", + " else:\n", + " print(f'Patch 3 already applied: {nf}')\n", + "\n", + "# Clear pycache\n", + "for d in glob.glob(f'{WORK_DIR}/**/tamm/**/__pycache__', recursive=True):\n", + " shutil.rmtree(d, ignore_errors=True)\n", + "for d in glob.glob('/usr/local/lib/**/tamm/**/__pycache__', recursive=True):\n", + " shutil.rmtree(d, ignore_errors=True)\n", + "print('\\nAll patches applied. Ready to train.')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Train\n", + "\n", + "~40 min/epoch on T4, ~2 hours total for 3 epochs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m examples.train_adapter \\\n --train-data ../train.jsonl \\\n --eval-data ../eval.jsonl \\\n --epochs 3 \\\n --learning-rate 1e-4 \\\n --batch-size 8 \\\n --precision f16-mixed \\\n --activation-checkpointing \\\n --checkpoint-dir ../fp16-lora-checkpoints/" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Save checkpoints" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "!cp -r {WORK_DIR}/fp16-lora-checkpoints {DRIVE_DIR}/\n!echo 'Checkpoints saved to Drive'" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Evaluate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "import json, subprocess\n\ntest_prompts = [\n 'find files changed in the last hour',\n 'show disk usage',\n 'generate a random password',\n 'kill a process by name',\n 'show http headers of a url',\n 'record terminal session',\n 'find files larger than 100mb',\n 'convert image to different format',\n 'show all listening ports',\n 'find files modified in the last 7 days',\n 'find files owned by root',\n 'count lines in all python files',\n 'show all environment variables',\n 'clear the terminal',\n 'compare two files',\n]\n\nsystem = 'Output a single shell command for zsh on macOS. No explanation, no markdown, no backticks. Just the command.'\n\nwith open(f'{WORK_DIR}/test_prompts.jsonl', 'w') as f:\n for p in test_prompts:\n f.write(json.dumps([\n {'role': 'system', 'content': system},\n {'role': 'user', 'content': p}\n ]) + '\\n')\n\nresult = subprocess.run(\n ['python3', '-m', 'examples.generate',\n '--prompt', '../test_prompts.jsonl',\n '--checkpoint', '../fp16-lora-checkpoints/adapter-final.pt',\n '--precision', 'f16-mixed'],\n capture_output=True, text=True,\n cwd=f'{WORK_DIR}/adapter_training_toolkit_v26_0_0'\n)\n\nlines = (result.stdout + result.stderr).strip().split('\\n')\nidx = 0\nfor line in lines:\n if 'Response for prompt' in line:\n answer = line.split(': ', 2)[-1].replace('', '').strip()\n prompt = test_prompts[idx] if idx < len(test_prompts) else '?'\n print(f'Q: {prompt:<45} A: {answer}')\n idx += 1\n\nif idx == 0:\n print('No output. Check error:')\n print('STDERR:', result.stderr[-500:])\n print('Return code:', result.returncode)" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Export .fmadapter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n --adapter-name hunch_fp16 \\\n --checkpoint ../fp16-lora-checkpoints/adapter-final.pt \\\n --output-dir ../fp16-lora-exports/\n\n!ls -lh {WORK_DIR}/fp16-lora-exports/\n!cp -r {WORK_DIR}/fp16-lora-exports {DRIVE_DIR}/\n!echo 'Adapter exported and saved to Drive'" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/training/train_qlora.ipynb b/training/train_qlora.ipynb new file mode 100644 index 0000000..1f6b5c9 --- /dev/null +++ b/training/train_qlora.ipynb @@ -0,0 +1,344 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# True QLoRA: Training Apple's 3B Model with 4-bit NF4\n", + "\n", + "Uses bitsandbytes NF4 quantization on the frozen base model.\n", + "Only ~5GB GPU memory — fits on free T4 with headroom.\n", + "\n", + "This is proper QLoRA as defined by [Dettmers et al. 2023](https://arxiv.org/abs/2305.14314):\n", + "4-bit quantized base + fp32 LoRA adapters." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Setup\n", + "\n", + "Upload to `My Drive/hunch-training/`:\n", + "- `adapter_training_toolkit_v26_0_0/` (from developer.apple.com)\n", + "- `prepare_data.py`, `train_qlora_full.py`, `tldr_bank.db`, `prompts.jsonl`" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mounted at /content/drive\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.3/2.3 MB\u001b[0m \u001b[31m19.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m362.6/362.6 kB\u001b[0m \u001b[31m39.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.1/73.1 kB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m11.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.6/1.6 MB\u001b[0m \u001b[31m58.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m60.7/60.7 MB\u001b[0m \u001b[31m12.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m:00:01\u001b[0m00:01\u001b[0m\n", + "\u001b[?25hCUDA: True\n", + "GPU: Tesla T4\n" + ] + } + ], + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')\n", + "\n", + "DRIVE_DIR = '/content/drive/MyDrive/hunch-training'\n", + "WORK_DIR = '/content/hunch-training'\n", + "\n", + "!mkdir -p {WORK_DIR}\n", + "!cp -r {DRIVE_DIR}/adapter_training_toolkit_v26_0_0 {WORK_DIR}/\n", + "!cp {DRIVE_DIR}/prepare_data.py {WORK_DIR}/\n", + "!cp {DRIVE_DIR}/train_qlora_full.py {WORK_DIR}/\n", + "!mkdir -p {WORK_DIR}/../bank {WORK_DIR}/../benchmark\n", + "!cp {DRIVE_DIR}/tldr_bank.db {WORK_DIR}/../bank/\n", + "!cp {DRIVE_DIR}/prompts.jsonl {WORK_DIR}/../benchmark/\n", + "\n", + "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && pip install -r requirements.txt -q\n", + "!pip install bitsandbytes -q\n", + "\n", + "import torch\n", + "print(f'CUDA: {torch.cuda.is_available()}')\n", + "if torch.cuda.is_available():\n", + " print(f'GPU: {torch.cuda.get_device_name(0)}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Prepare training data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded 21478 entries from bank\n", + "Filtered to sources {'override'}: 134 entries (from 21478)\n", + " override: 134\n", + "Excluded 38 entries matching benchmark prompts\n", + "After dedup: 96 unique entries (removed 0)\n", + "Small dataset — using all 96 examples for both train and eval\n", + "Wrote 96 examples to /content/hunch-training/train.jsonl\n", + "Wrote 96 examples to /content/hunch-training/eval.jsonl\n", + "\n", + "Sample training examples:\n", + " user: show response headers\n", + " asst: curl -I https://example.com\n", + "\n", + " user: dns lookup for a domain\n", + " asst: dig example.com\n", + "\n", + " user: record shell session to file\n", + " asst: script session.log\n", + "\n" + ] + } + ], + "source": [ + "!cd {WORK_DIR} && python3 prepare_data.py --sources override" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Train\n", + "\n", + "No patches needed — `train_qlora_full.py` handles everything:\n", + "mmap loading, NF4 quantization, training loop.\n", + "\n", + "~5GB GPU memory. Can use large batch sizes on T4." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Device: cuda | RAM=0.6GB GPU=0.0GB\n", + "Quantized 280 layers to NF4\n", + "Trainable: 67M params | RAM=6.3GB GPU=2.2GB\n", + "Train: 96 examples, 12 batches\n", + "Eval: 96 examples\n", + "\n", + "============================================================\n", + "Training: 20 epochs, batch 8, lr 0.0001\n", + "============================================================\n", + "\n", + "Epoch 1/20\n", + "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = c10::Half, Cannot dispatch to fused implementation. (Triggered internally at /pytorch/aten/src/ATen/native/layer_norm.cpp:344.)\n", + " return torch.rms_norm(input, normalized_shape, weight, eps)\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch1.pt\n", + " Train loss: 1.4963 | Eval loss: 0.7162 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 2/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch2.pt\n", + " Train loss: 0.5486 | Eval loss: 0.2153 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 3/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch3.pt\n", + " Train loss: 0.1835 | Eval loss: 0.0547 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 4/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch4.pt\n", + " Train loss: 0.0840 | Eval loss: 0.0401 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 5/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch5.pt\n", + " Train loss: 0.0463 | Eval loss: 0.0093 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 6/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch6.pt\n", + " Train loss: 0.0166 | Eval loss: 0.0046 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 7/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch7.pt\n", + " Train loss: 0.0043 | Eval loss: 0.0013 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 8/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch8.pt\n", + " Train loss: 0.0013 | Eval loss: 0.0003 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 9/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch9.pt\n", + " Train loss: 0.0003 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 10/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch10.pt\n", + " Train loss: 0.0001 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 11/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch11.pt\n", + " Train loss: 0.0001 | Eval loss: 0.0001 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 12/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch12.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 13/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch13.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 14/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch14.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 15/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch15.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 16/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch16.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 17/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch17.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.8GB GPU=3.0GB\n", + "\n", + "Epoch 18/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch18.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 19/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch19.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.6GB GPU=3.0GB\n", + "\n", + "Epoch 20/20\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-epoch20.pt\n", + " Train loss: 0.0000 | Eval loss: 0.0000 | RAM=1.8GB GPU=3.0GB\n", + "Saved checkpoint (254MB) to qlora-override-checkpoints/adapter-final.pt\n", + "\n", + "Done! Export with:\n", + " python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint qlora-override-checkpoints/adapter-final.pt --output-dir qlora-override-checkpoints//\n" + ] + } + ], + "source": [ + "!cd {WORK_DIR} && python3 train_qlora_full.py \\\n", + " --epochs 20 \\\n", + " --batch-size 8 \\\n", + " --learning-rate 1e-4 \\\n", + " --checkpoint-dir qlora-override-checkpoints/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Save checkpoints" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Checkpoints saved to Drive\n" + ] + } + ], + "source": [ + "!cp -r {WORK_DIR}/qlora-override-checkpoints/ {DRIVE_DIR}/qlora-override-checkpoints\n", + "!echo 'Checkpoints saved to Drive'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Export .fmadapter" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "scikit-learn version 1.6.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.\n", + "XGBoost version 3.2.0 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.\n", + "2026-04-18 01:46:36.123769: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", + "E0000 00:00:1776476796.352370 4085 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "E0000 00:00:1776476796.414439 4085 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "W0000 00:00:1776476796.851699 4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "W0000 00:00:1776476796.851750 4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "W0000 00:00:1776476796.851754 4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "W0000 00:00:1776476796.851758 4085 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.\n", + "2026-04-18 01:46:36.891354: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", + "To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", + "TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.\n", + "Torch version 2.10.0+cu128 has not been tested with coremltools. You may run into unexpected errors. Torch 2.5.0 is the most recent version that has been tested.\n", + "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'\n", + "WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'\n", + "total 4.0K\n", + "drwxr-xr-x 2 root root 4.0K Apr 18 01:46 hunch_qlora.fmadapter\n", + "Adapter exported and saved to Drive\n" + ] + } + ], + "source": [ + "!cd {WORK_DIR}/adapter_training_toolkit_v26_0_0 && python3 -m export.export_fmadapter \\\n", + " --adapter-name hunch_qlora \\\n", + " --checkpoint ../qlora-override-checkpoints/adapter-final.pt \\\n", + " --output-dir ../qlora-override-exports/\n", + "\n", + "!ls -lh {WORK_DIR}/qlora-override-exports/\n", + "!cp -r {WORK_DIR}/qlora-override-exports {DRIVE_DIR}/qlora-override-exports\n", + "!echo 'Adapter exported and saved to Drive'" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/training/train_qlora_full.py b/training/train_qlora_full.py new file mode 100644 index 0000000..780113f --- /dev/null +++ b/training/train_qlora_full.py @@ -0,0 +1,376 @@ +#!/usr/bin/env python3 +""" +True QLoRA training: 4-bit NF4 base model + fp32 LoRA adapters. + +Uses bitsandbytes for NF4 quantization. Trains on hunch dataset. +Works on 24GB Mac (MPS) and Colab T4 (CUDA). ~5GB GPU memory. + +Usage: + python3 train_qlora_full.py # train 3 epochs + python3 train_qlora_full.py --epochs 1 --batch-size 4 # quick test + python3 train_qlora_full.py --eval-only --checkpoint checkpoints/adapter-final.pt + +Requirements: + pip install bitsandbytes psutil +""" + +import sys +import os +import gc +import json +import time +import argparse +import psutil +from pathlib import Path + +TOOLKIT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "adapter_training_toolkit_v26_0_0") +sys.path.insert(0, TOOLKIT_DIR) + +import torch +import torch.nn as nn +from torch.utils.data import Dataset, DataLoader +import tamm.utils.json +from tamm.tokenizers.afm import AFMTokenizer + +ASSETS = Path(TOOLKIT_DIR) / "assets" +TRAINING_DIR = Path(__file__).parent + + +def patch_rms_norm(): + """Patch tamm's rms_norm to handle dtype mismatch (fp16 model + fp32 cast).""" + import glob + patterns = [ + os.path.join(TOOLKIT_DIR, "venv", "lib", "*", "site-packages", "tamm", "layers", "functional.py"), + os.path.join(sys.prefix, "lib", "*", "dist-packages", "tamm", "layers", "functional.py"), + ] + for pattern in patterns: + for path in glob.glob(pattern): + code = open(path).read() + if "weight.to(tensor.dtype)" not in code: + old = " tensor = _torch_compatibility.rms_norm(\n tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\n )" + new = " if weight is not None and weight.dtype != tensor.dtype:\n weight = weight.to(tensor.dtype)\n tensor = _torch_compatibility.rms_norm(\n tensor, normalized_shape=normalized_shape, weight=weight, eps=eps\n )" + code = code.replace(old, new) + open(path, "w").write(code) + # Clear pycache + cache_dir = os.path.join(os.path.dirname(path), "__pycache__") + if os.path.exists(cache_dir): + import shutil; shutil.rmtree(cache_dir) + print(f"Patched rms_norm: {path}") + else: + print(f"rms_norm already patched: {path}") + + +def get_device(): + if torch.cuda.is_available(): + return torch.device("cuda") + if torch.backends.mps.is_available(): + return torch.device("mps") + return torch.device("cpu") + + +def mem_str(): + ram = psutil.Process().memory_info().rss / 1024**3 + if torch.cuda.is_available(): + gpu = torch.cuda.memory_allocated() / 1024**3 + elif torch.backends.mps.is_available(): + gpu = torch.mps.current_allocated_memory() / 1024**3 + else: + gpu = 0 + return f"RAM={ram:.1f}GB GPU={gpu:.1f}GB" + + +def load_model_qlora(device): + """Load base model with NF4 quantization.""" + import bitsandbytes as bnb + + # Load config and create model in fp16 (6GB instead of 12GB) + with open(ASSETS / "base-model-config.json") as f: + config = tamm.utils.json.load(f) + config.dtype = torch.float16 + model = config.create_model() + + # Load weights via mmap (minimal RAM) + sd = torch.load(str(ASSETS / "base-model.pt"), map_location="cpu", mmap=True, weights_only=False) + model.load_state_dict(sd, strict=True) + del sd; gc.collect() + + # Freeze non-adapter params + for name, param in model.named_parameters(): + param.requires_grad = "adapter" in name + + # Quantize frozen Linear layers to NF4 + replacements = [] + for name, module in model.named_modules(): + if not isinstance(module, nn.Linear): + continue + if "adapter" in name or any(p.requires_grad for p in module.parameters()): + continue + replacements.append((name, module)) + + for name, module in replacements: + new_module = bnb.nn.Linear4bit( + module.in_features, module.out_features, + bias=module.bias is not None, + compute_dtype=torch.float16, + quant_type="nf4", + ) + new_module.weight = bnb.nn.Params4bit( + module.weight.data, requires_grad=False, + quant_type="nf4", compress_statistics=torch.cuda.is_available(), + ) + if module.bias is not None: + new_module.bias = module.bias + + parts = name.rsplit(".", 1) + if len(parts) == 2: + parent = dict(model.named_modules())[parts[0]] + setattr(parent, parts[1], new_module) + else: + setattr(model, name, new_module) + + gc.collect() + print(f"Quantized {len(replacements)} layers to NF4") + + # Move to device + model = model.to(device) + + # Keep adapters in fp32 for stable training with gradient scaling + for name, param in model.named_parameters(): + if param.requires_grad and param.dtype != torch.float32: + param.data = param.data.float() + + trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) + print(f"Trainable: {trainable/1e6:.0f}M params | {mem_str()}") + return model + + +def load_model_with_checkpoint(device, checkpoint_path): + """Load QLoRA model and restore adapter weights from checkpoint.""" + model = load_model_qlora(device) + sd = torch.load(checkpoint_path, map_location=device, weights_only=False) + # Only load adapter weights + adapter_sd = {k: v for k, v in sd.items() if "adapter" in k} + model.load_state_dict(adapter_sd, strict=False) + print(f"Loaded {len(adapter_sd)} adapter weights from {checkpoint_path}") + return model + + +class CommandDataset(Dataset): + """Load JSONL training data.""" + def __init__(self, path, tokenizer, max_length=512): + self.examples = [] + self.tokenizer = tokenizer + self.max_length = max_length + + with open(path) as f: + for line in f: + messages = json.loads(line) + # Format: system + user + assistant + prompt = "" + for msg in messages: + if msg["role"] == "system": + prompt += f"system\n{msg['content']} " + elif msg["role"] == "user": + prompt += f"user\n {msg['content']} " + response = "" + for msg in messages: + if msg["role"] == "assistant": + response = f"assistant\n {msg['content']}" + full_text = prompt + response + prompt_len = len(tokenizer.encode(prompt)) + self.examples.append((full_text, prompt_len)) + + def __len__(self): + return len(self.examples) + + def __getitem__(self, idx): + text, prompt_len = self.examples[idx] + tokens = self.tokenizer.encode(text) + tokens = tokens[:self.max_length] + prompt_len = min(prompt_len, len(tokens)) + return torch.tensor(tokens, dtype=torch.long), prompt_len + + +def collate_fn(batch): + """Pad sequences and create labels with masking for prompt and padding tokens.""" + tokens_list, prompt_lens = zip(*batch) + max_len = max(len(x) for x in tokens_list) + input_ids = torch.zeros(len(tokens_list), max_len, dtype=torch.long) + labels = torch.full((len(tokens_list), max_len), -100, dtype=torch.long) + for i, (tokens, prompt_len) in enumerate(zip(tokens_list, prompt_lens)): + input_ids[i, :len(tokens)] = tokens + # Only compute loss on assistant response tokens (after prompt) + labels[i, prompt_len:len(tokens)] = tokens[prompt_len:] + return input_ids, labels + + +def train_epoch(model, dataloader, optimizer, device, epoch, scaler=None): + model.train() + total_loss = 0 + n_batches = 0 + start = time.time() + + for i, (input_ids, labels) in enumerate(dataloader): + input_ids = input_ids.to(device) + labels = labels.to(device) + + # Forward — labels have -100 for prompt and padding tokens (ignored by CrossEntropyLoss) + if scaler: + with torch.amp.autocast(device_type=str(device), dtype=torch.float16): + output = model(input_ids) + logits = output.logits if hasattr(output, 'logits') else output + loss = nn.CrossEntropyLoss(ignore_index=-100)( + logits[:, :-1, :].contiguous().view(-1, logits.size(-1)), + labels[:, 1:].contiguous().view(-1) + ) + else: + output = model(input_ids) + logits = output.logits if hasattr(output, 'logits') else output + loss = nn.CrossEntropyLoss(ignore_index=-100)( + logits[:, :-1, :].contiguous().view(-1, logits.size(-1)), + labels[:, 1:].contiguous().view(-1) + ) + + # Backward + optimizer.zero_grad() + if scaler: + scaler.scale(loss).backward() + scaler.unscale_(optimizer) + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) + scaler.step(optimizer) + scaler.update() + else: + loss.backward() + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) + optimizer.step() + + total_loss += loss.item() + n_batches += 1 + + log_every = 10 if len(dataloader) < 50 else 20 + if (i + 1) % log_every == 0: + avg = total_loss / n_batches + elapsed = time.time() - start + it_s = (i + 1) / elapsed + remaining = (len(dataloader) - i - 1) / it_s / 60 + print(f" [{i+1}/{len(dataloader)}] loss={avg:.3f} {it_s:.2f}it/s ~{remaining:.0f}min left | {mem_str()}") + + return total_loss / max(n_batches, 1) + + +def evaluate(model, dataloader, device): + model.eval() + total_loss = 0 + n_batches = 0 + + with torch.no_grad(): + for input_ids, labels in dataloader: + input_ids = input_ids.to(device) + labels = labels.to(device) + with torch.amp.autocast(device_type=str(device), dtype=torch.float16): + output = model(input_ids) + logits = output.logits if hasattr(output, 'logits') else output + loss = nn.CrossEntropyLoss(ignore_index=-100)( + logits[:, :-1, :].contiguous().view(-1, logits.size(-1)), + labels[:, 1:].contiguous().view(-1) + ) + total_loss += loss.item() + n_batches += 1 + + return total_loss / max(n_batches, 1) + + +def save_adapter_checkpoint(model, path, optimizer=None, epoch=None): + """Save adapter weights as flat state dict (compatible with export_fmadapter).""" + adapter_sd = {k: v.cpu() for k, v in model.state_dict().items() if "adapter" in k} + torch.save(adapter_sd, path) + size_mb = os.path.getsize(path) / 1024**2 + print(f"Saved checkpoint ({size_mb:.0f}MB) to {path}") + + +def main(): + parser = argparse.ArgumentParser(description="QLoRA training for hunch") + parser.add_argument("--epochs", type=int, default=3) + parser.add_argument("--batch-size", type=int, default=8) + parser.add_argument("--learning-rate", type=float, default=1e-4) + parser.add_argument("--train-data", default=str(TRAINING_DIR / "train.jsonl")) + parser.add_argument("--eval-data", default=str(TRAINING_DIR / "eval.jsonl")) + parser.add_argument("--checkpoint-dir", default=str(TRAINING_DIR / "qlora-checkpoints")) + parser.add_argument("--checkpoint", type=str, help="Resume from checkpoint") + parser.add_argument("--eval-only", action="store_true") + args = parser.parse_args() + + device = get_device() + print(f"Device: {device} | {mem_str()}") + + # Patch rms_norm for fp16 compatibility + patch_rms_norm() + + # Generate training data if needed + if not os.path.exists(args.train_data): + print("Generating training data...") + os.system(f"cd {TRAINING_DIR} && python3 prepare_data.py") + + # Load tokenizer + tokenizer = AFMTokenizer(str(ASSETS / "tokenizer.model")) + + # Load model + if args.checkpoint: + model = load_model_with_checkpoint(device, args.checkpoint) + else: + model = load_model_qlora(device) + + if args.eval_only: + eval_dataset = CommandDataset(args.eval_data, tokenizer) + eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn) + eval_loss = evaluate(model, eval_loader, device) + print(f"Eval loss: {eval_loss:.4f}") + return + + # Data + train_dataset = CommandDataset(args.train_data, tokenizer) + eval_dataset = CommandDataset(args.eval_data, tokenizer) + train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, collate_fn=collate_fn) + eval_loader = DataLoader(eval_dataset, batch_size=args.batch_size, collate_fn=collate_fn) + + print(f"Train: {len(train_dataset)} examples, {len(train_loader)} batches") + print(f"Eval: {len(eval_dataset)} examples") + + # Optimizer + optimizer = torch.optim.AdamW( + [p for p in model.parameters() if p.requires_grad], + lr=args.learning_rate, + weight_decay=0.01 + ) + + # Gradient scaler for mixed precision + scaler = torch.amp.GradScaler(device=str(device)) if (torch.cuda.is_available() or torch.backends.mps.is_available()) else None + + # Checkpoint dir + os.makedirs(args.checkpoint_dir, exist_ok=True) + + # Training loop + print(f"\n{'='*60}") + print(f"Training: {args.epochs} epochs, batch {args.batch_size}, lr {args.learning_rate}") + print(f"{'='*60}") + + for epoch in range(args.epochs): + print(f"\nEpoch {epoch+1}/{args.epochs}") + train_loss = train_epoch(model, train_loader, optimizer, device, epoch, scaler) + + # Save checkpoint before eval (in case eval crashes) + ckpt_path = os.path.join(args.checkpoint_dir, f"adapter-epoch{epoch+1}.pt") + save_adapter_checkpoint(model, ckpt_path) + + eval_loss = evaluate(model, eval_loader, device) + print(f" Train loss: {train_loss:.4f} | Eval loss: {eval_loss:.4f} | {mem_str()}") + + # Save final + final_path = os.path.join(args.checkpoint_dir, "adapter-final.pt") + save_adapter_checkpoint(model, final_path) + print(f"\nDone! Export with:") + print(f" python3 -m export.export_fmadapter --adapter-name hunch_qlora --checkpoint {final_path} --output-dir {args.checkpoint_dir}/") + + +if __name__ == "__main__": + main() diff --git a/training/train_qlora_test.py b/training/train_qlora_test.py new file mode 100644 index 0000000..7424862 --- /dev/null +++ b/training/train_qlora_test.py @@ -0,0 +1,216 @@ +#!/usr/bin/env python3 +""" +True QLoRA training test: 4-bit NF4 base model + fp32 LoRA adapters. + +Uses bitsandbytes for NF4 quantization. Tests loading + one training step. + +Usage: + pip install bitsandbytes + python3 train_qlora_test.py + +This script: + 1. Loads the base model + 2. Replaces frozen Linear layers with 4-bit NF4 equivalents + 3. Runs one training batch to verify it works + 4. Reports memory usage at each step +""" + +import sys +import os +import gc +import time +import psutil + +TOOLKIT_DIR = os.path.join(os.path.dirname(__file__), "adapter_training_toolkit_v26_0_0") +sys.path.insert(0, TOOLKIT_DIR) + +import torch +import tamm.utils.json +from pathlib import Path + +ASSETS = Path(TOOLKIT_DIR) / "assets" + + +def mem(): + return psutil.Process().memory_info().rss / 1024**3 + +def gpu_mem(): + if torch.cuda.is_available(): + return torch.cuda.memory_allocated() / 1024**3 + elif torch.backends.mps.is_available(): + return torch.mps.current_allocated_memory() / 1024**3 + return 0 + +def get_device(): + if torch.cuda.is_available(): + return torch.device("cuda") + if torch.backends.mps.is_available(): + return torch.device("mps") + return torch.device("cpu") + + +def quantize_linear_to_4bit(model): + """Replace frozen nn.Linear layers with bitsandbytes 4-bit Linear.""" + try: + import bitsandbytes as bnb + except ImportError: + print("ERROR: pip install bitsandbytes") + sys.exit(1) + + quantized = 0 + skipped = 0 + + # Collect replacements (can't modify during iteration) + replacements = [] + for name, module in model.named_modules(): + if not isinstance(module, torch.nn.Linear): + continue + if "adapter" in name: + skipped += 1 + continue + if any(p.requires_grad for p in module.parameters()): + skipped += 1 + continue + replacements.append((name, module)) + + # Apply replacements + for name, module in replacements: + # Create 4-bit linear + new_module = bnb.nn.Linear4bit( + module.in_features, + module.out_features, + bias=module.bias is not None, + compute_dtype=torch.float16, + quant_type="nf4", + ) + + # Quantize weights + new_module.weight = bnb.nn.Params4bit( + module.weight.data, + requires_grad=False, + quant_type="nf4", + compress_statistics=True, + ) + if module.bias is not None: + new_module.bias = module.bias + + # Replace in parent module + parts = name.rsplit(".", 1) + if len(parts) == 2: + parent_name, child_name = parts + parent = dict(model.named_modules())[parent_name] + setattr(parent, child_name, new_module) + else: + setattr(model, name, new_module) + + quantized += 1 + + # Free memory + gc.collect() + if torch.cuda.is_available(): + torch.cuda.empty_cache() + + print(f"QLoRA: quantized {quantized} layers to NF4, skipped {skipped}") + return model + + +def main(): + device = get_device() + print(f"Device: {device}") + print(f"System RAM: {psutil.virtual_memory().total / 1024**3:.0f}GB") + print(f"Before: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB") + + # Step 1: Load model config + with open(ASSETS / "base-model-config.json") as f: + config = tamm.utils.json.load(f) + + # Step 2: Create model on CPU + print("\n--- Creating model ---") + model = config.create_model() + print(f"After create_model: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB") + + # Step 3: Load weights via mmap + print("\n--- Loading weights (mmap) ---") + sd = torch.load(str(ASSETS / "base-model.pt"), map_location="cpu", mmap=True, weights_only=False) + model.load_state_dict(sd, strict=True) + del sd; gc.collect() + print(f"After load+del: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB") + + # Step 4: Freeze non-adapter params + for name, param in model.named_parameters(): + param.requires_grad = "adapter" in name + + trainable_before = sum(p.numel() for p in model.parameters() if p.requires_grad) + frozen_before = sum(p.numel() for p in model.parameters() if not p.requires_grad) + print(f"Trainable: {trainable_before/1e6:.0f}M, Frozen: {frozen_before/1e6:.0f}M") + + # Step 5: Quantize frozen layers to 4-bit NF4 + print("\n--- Quantizing to NF4 ---") + model = quantize_linear_to_4bit(model) + gc.collect() + print(f"After quantize: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB") + + # Step 6: Move to device + print(f"\n--- Moving to {device} ---") + model = model.to(device) + gc.collect() + if torch.cuda.is_available(): + torch.cuda.empty_cache() + print(f"After to({device}): RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB") + + # Step 7: Verify trainable params are fp32 + for name, param in model.named_parameters(): + if param.requires_grad and "adapter" in name: + if param.dtype != torch.float32: + param.data = param.data.float() + + trainable_after = sum(p.numel() for p in model.parameters() if p.requires_grad) + print(f"Trainable params: {trainable_after/1e6:.0f}M") + + # Step 8: Test one forward + backward pass + print("\n--- Test forward/backward ---") + try: + tokenizer_path = ASSETS / "tokenizer.model" + from tamm.tokenizers.afm import AFMTokenizer + tokenizer = AFMTokenizer(str(tokenizer_path)) + + # Create a simple input + text = "Output a single shell command for zsh on macOS.\nfind files changed in the last hour" + tokens = tokenizer.encode(text) + input_ids = torch.tensor([tokens[:50]], device=device) + labels = input_ids.clone() + + # Forward pass + output = model(input_ids) + if hasattr(output, 'logits'): + logits = output.logits + else: + logits = output + + # Compute loss + loss_fn = torch.nn.CrossEntropyLoss() + shift_logits = logits[:, :-1, :].contiguous() + shift_labels = labels[:, 1:].contiguous() + loss = loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) + print(f"Loss: {loss.item():.4f}") + + # Backward pass + loss.backward() + print(f"After backward: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB") + + # Check gradients exist on adapter params + grad_count = sum(1 for p in model.parameters() if p.grad is not None) + print(f"Params with gradients: {grad_count}") + + print("\nSUCCESS: QLoRA forward + backward works!") + + except Exception as e: + print(f"\nFailed at forward/backward: {e}") + import traceback + traceback.print_exc() + + print(f"\nFinal: RAM={mem():.1f}GB, GPU={gpu_mem():.1f}GB") + + +if __name__ == "__main__": + main()