docs(repo-arch): teacher7b v2 experiment and adapter by bearmug · Pull Request #1 · fiale-plus/pi

bearmug · 2026-05-17T20:01:07Z

Summary

This PR records the full repo-arch experiment on fiale-plus/pi and adds the current best adapter artifact (teacher7b-v2) plus the reproducible runbook.

What changed

Added the teacher7b-v2 adapter and eval artifact
Documented the experiment outcome in docs/experiments/pi-repo-arch-experiment.md
Updated docs/experiments/ci-pipeline.md with the current pipeline
Fixed scripts/teacher_batch_v2.py so batch results preserve id / question mapping

Best result

teacher7b-v2 is the most successful version overall for the repo-insights goal:

200 teacher targets from Modal 7B
300 iters, rank 8, 4 layers
val loss: 2.19 -> 0.543
behavioral eval: 118 pkg refs, 35/45 questions with package refs, 0 deflections
head-to-head: beats teacher7b v1 and the retrieval/card baselines; FUSED v1 still slightly edges it on raw package refs

TTFS / TPS

Warm-cache benchmark on 4 representative questions, 2 repeats each, top_k=5:

cold load: 0.51s (local cache warm)
retrieval: 290-324ms
TTFS total: 615-906ms
TTFS model-only: 322-616ms
throughput: 36.7-40.0 tok/s (avg 38.14 tok/s)

Experiment chronology

Order	Phase	Data	Key change	Result
1	LoRA v1	92 examples	question -> answer	all "No historical warnings"
2	LoRA v2	120 examples	more cards, still no context	all "No historical warnings"
3	Retrieval baseline	cards + commits	deterministic facts	strong baseline
4	FUSED v1	cards + commits + base	fused retrieval + synthesis	147 pkg refs, 30/45, 0 deflections
5	Teacher7B v1	50 teacher targets	distilled 7B -> 1.5B	83 pkg refs, 33/45, beats FUSED v1
6	Teacher7B v2	200 teacher targets	more data, same rank/layers	118 pkg refs, 35/45, 0 deflections

How to run inference

source /opt/homebrew/var/mtplx/venv-0.1.0rc3/bin/activate
cd /Users/pavel/repos/fiale-plus/pi
python3 scripts/hybrid_answer.py --question "What should I know about agent-session.ts?" --adapter .repo-arch/adapters/teacher7b-v2

Modal usage

Deploy teacher: modal deploy scripts/modal_7b.py
Generate targets:
- modal run scripts/teacher_batch.py --limit 50
- modal run scripts/teacher_batch_v2.py --limit 150 --parallel 4
Use raw modal.Function scripts, not notebooks or web endpoints
Function.from_name("pi-7b-teacher", "generate") is the deployed entrypoint

Modal issues encountered

Secret.from_name(..., required=False) changed in Modal 1.4.x
Function.lookup() is no longer the right entrypoint; use Function.from_name()
container_idle_timeout was renamed to scaledown_window
Parallel batch results must carry id and question to avoid out-of-order mapping

Notes

Retrieval stays local; Modal is used only for teacher inference.
Full session logs and raw evals are mirrored in the Obsidian vault.

- Runbook: staged extraction, curation, dataset, and MLX LoRA training - 4086 commits mined, 20 cards, 92 training examples - LoRA adapter trained on Qwen2.5-Coder-1.5B (rank 8, 100 iters) - Behavioral eval harness: 45 questions x 3 modes (base/retrieval/lora) - Sanity run results: retrieval-only wins decisively (62 file mentions, 0 deflections) vs base (14 mentions, 5 deflects) vs lora (10/10 'no warnings found') - Current LoRA adapter is not useful — training data too small (82 examples), negative ratio too high (39%), training observability missing - Do NOT scale to cloud GPU until dataset improves - Experiment report with quality notes, risks, and next actions

…RA retrain - Lowered card min-confidence to 0.3, max-cards to 50 → 25 cards (+5) - Curated 18 training-relevant cards (accepted code-level patterns, rejected CHANGELOG churn, design rationale, package-lock noise) - Dataset grew from 92 to 120 examples, negative ratio 39%→30.8% - Retrained with v2 adapter: 200 iters, val loss 3.3→0.69 - Behavioral eval result IDENTICAL to v1: all answers 'No historical warnings found' - Conclusion: LoRA-only is fundamentally insufficient for repo Q&A. Training data format (question→answer without context) cannot teach the model to distinguish covered vs uncovered files. - Right architecture: hybrid retrieval+LoRA, where retrieval provides card context and adapter formats/interpretes it.

- hybrid_export.py: generates context+question->answer training data from repo-arch cards. Each example includes up to 4 related cards as context. - hybrid_answer.py: inference script that retrieves cards via repo-arch similar, formats as context, and answers with MLX LoRA adapter. - Trained hybrid adapter (300 iters, val loss 2.39->0.35, 37 examples) - Behavioral test: hybrid adapter correctly used card context for agent-session.ts (cited 94 fixes) and admitted ignorance for uncovered topics. - Base model with context ALSO works well - the context itself contains the answers. Hybrid LoRA adds marginal value at this dataset size. - Conclusion verified: retrieval is the truth engine, LoRA is optional formatting aid for larger datasets.

- build_index.py: indexes all 4089 unique commits as BM25 search documents - full_retrieve.py: search the commit index - full_history_answer.py: full commit history retrieval + base model answer - fused_answer.py: combined repo-arch cards + commit history retrieval + answer - 5-way eval comparison across 45 behavioral questions Results (147 pkg refs = best): FUSED: 147 pkg, 21 cmt, 33 sha, 0 defl — beats all modes head-to-head Retrieval: 97 pkg, 18 cmt, 0 sha, 0 defl — strong baseline Full-hist: 124 pkg, 2 cmt, 53 sha, 0 defl — broader but misses aggregation Hybrid: 66 pkg, 27 cmt, 0 sha, 5 defl — deflections hurt Base: 13 pkg, 0 cmt, 11 sha,10 defl — unusable Head-to-head (Fused wins): vs Retrieval: 22-10 vs Full-Hist: 20-17 vs Hybrid: 27-14 vs Base: 33-1 Conclusion: fused cards+commits retrieval is the optimal architecture for repo-specific Q&A. Cards provide pattern aggregation (co-change clusters, repeated fix counts), commits provide specific examples (SHAs, files). LoRA adds marginal value at this dataset size.

- gen_cluster_dataset.py: generates training data by clustering commits per file (752 files with >=3 commits, 200 clusters x 3 questions each) - gen_large_dataset.py: synthetic dataset via base model self-instruct - Training data produced: - clusters dataset: 542 examples (488 train + 54 valid) - large dataset: 117 examples (106 train + 11 valid) - gen_cluster_dataset supports --dry-run mode for estimating coverage - Each example: file commit history context + question -> aggregated answer

…esn't beat fused retrieval - gen_distill_dataset.py: generates training data in exact inference format (280 high-quality examples from 300 candidates, filtered 20 low-quality) - Distill adapter: 300 iters, val loss 2.3 -> 0.194 (excellent convergence) - Full 45-question eval vs 5 other modes - Head-to-head: Distill wins 5/45, Fused wins 27/45, Tie 13/45 - Distill adapter improves formatting (40 SHA refs, structured answers) but doesn't beat base model on factual precision with same context - Confirms: retrieval is the truth engine. LoRA improves formatting, not factual recall.

- Query type detection: pattern/history/risk/package/test/general - Dynamic context budget: 2-6 cards, 6-14 commits per question type - Dedup by SHA/package/subject, group commits by package - Retrieve 50 candidates, pack down to diverse set - 45-question eval: v2 wins on breadth (33 vs 30 qs with pkg refs), v1 wins on depth (147 vs 111 total pkg refs) - Head-to-head: v2 11 - v1 24 (ties: 10) - Key improvement: Q17 went from 0->2 pkg refs (previously worst question) - Diminishing returns: 1.5B model's 2048-token context is the ceiling

- Modal 7B setup: Qwen2.5-Coder-7B-Instruct on A10G with 32k context - Teacher batch pipeline: local retrieval + Modal 7B inference - Hybrid search index: BM25 + all-MiniLM-L6-v2 embeddings for 4089 commits with RRF fusion (384-dim, 6.3MB) - Full 45-question eval on hybrid search: 31/45 qs covered, 43 SHAs, 0 deflections - Head-to-head: hybrid beats retrieval (23-13), full-hist (22-17), ties with fused v2 (18-17), loses to fused v1 (17-22) - First 3 teacher targets generated via Modal 7B: specific SHAs, dates, paths

- Qwen2.5-Coder-7B-Instruct on Modal A10G for teacher generation - 50 teacher targets generated (0 failures) - Trained teacher7b adapter: 300 iters, val loss 2.21->0.73 - 45-question eval: 83 pkg refs, 33/45 qs covered, 35 SHAs, 1 deflection - Head-to-head beats FUSED v1 (20-18), RETRIEVAL (24-13), FULL-HIST (24-17) - Ties with HYBRID (17-17) - First time any approach beats the fused retrieval baseline - Full experiment post-mortem written to Obsidian vault

…her + distillation

…ions - Generated 200 teacher targets via Modal 7B (Qwen2.5-Coder-7B on A10G) - Each target: hybrid retrieval context + question + 7B answer - Training: 300 iters, rank 8, 4 layers, val loss 2.19->0.543 - 45-question eval: 118 pkg refs, 35/45 covered (BEST), 0 deflections - Head-to-head: beats v1 (20-18), retrieval (23-16), full-hist (21-17), ties with hybrid (19-19), loses to fused (17-19) - Teacher7b v2 is the most well-rounded adapter: broadest coverage, zero deflections, strong package referencing

bearmug added 13 commits May 15, 2026 23:02

docs: CI pipeline architecture for recurring repo mining + Modal teac…

b2b79bf

…her + distillation

docs(repo-arch): add teacher7b v2 runbook and fix batcher

0048382

docs(repo-arch): add ttfs tps benchmark

89622b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(repo-arch): teacher7b v2 experiment and adapter#1

docs(repo-arch): teacher7b v2 experiment and adapter#1
bearmug wants to merge 13 commits into
mainfrom
repo_arch_pi_experiment_runbook

bearmug commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bearmug commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Best result

TTFS / TPS

Experiment chronology

How to run inference

Modal usage

Modal issues encountered

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bearmug commented May 17, 2026 •

edited

Loading