docs(repo-arch): teacher7b v2 experiment and adapter#1
Open
bearmug wants to merge 13 commits into
Open
Conversation
- Runbook: staged extraction, curation, dataset, and MLX LoRA training - 4086 commits mined, 20 cards, 92 training examples - LoRA adapter trained on Qwen2.5-Coder-1.5B (rank 8, 100 iters) - Behavioral eval harness: 45 questions x 3 modes (base/retrieval/lora) - Sanity run results: retrieval-only wins decisively (62 file mentions, 0 deflections) vs base (14 mentions, 5 deflects) vs lora (10/10 'no warnings found') - Current LoRA adapter is not useful — training data too small (82 examples), negative ratio too high (39%), training observability missing - Do NOT scale to cloud GPU until dataset improves - Experiment report with quality notes, risks, and next actions
…RA retrain - Lowered card min-confidence to 0.3, max-cards to 50 → 25 cards (+5) - Curated 18 training-relevant cards (accepted code-level patterns, rejected CHANGELOG churn, design rationale, package-lock noise) - Dataset grew from 92 to 120 examples, negative ratio 39%→30.8% - Retrained with v2 adapter: 200 iters, val loss 3.3→0.69 - Behavioral eval result IDENTICAL to v1: all answers 'No historical warnings found' - Conclusion: LoRA-only is fundamentally insufficient for repo Q&A. Training data format (question→answer without context) cannot teach the model to distinguish covered vs uncovered files. - Right architecture: hybrid retrieval+LoRA, where retrieval provides card context and adapter formats/interpretes it.
- hybrid_export.py: generates context+question->answer training data from repo-arch cards. Each example includes up to 4 related cards as context. - hybrid_answer.py: inference script that retrieves cards via repo-arch similar, formats as context, and answers with MLX LoRA adapter. - Trained hybrid adapter (300 iters, val loss 2.39->0.35, 37 examples) - Behavioral test: hybrid adapter correctly used card context for agent-session.ts (cited 94 fixes) and admitted ignorance for uncovered topics. - Base model with context ALSO works well - the context itself contains the answers. Hybrid LoRA adds marginal value at this dataset size. - Conclusion verified: retrieval is the truth engine, LoRA is optional formatting aid for larger datasets.
- build_index.py: indexes all 4089 unique commits as BM25 search documents - full_retrieve.py: search the commit index - full_history_answer.py: full commit history retrieval + base model answer - fused_answer.py: combined repo-arch cards + commit history retrieval + answer - 5-way eval comparison across 45 behavioral questions Results (147 pkg refs = best): FUSED: 147 pkg, 21 cmt, 33 sha, 0 defl — beats all modes head-to-head Retrieval: 97 pkg, 18 cmt, 0 sha, 0 defl — strong baseline Full-hist: 124 pkg, 2 cmt, 53 sha, 0 defl — broader but misses aggregation Hybrid: 66 pkg, 27 cmt, 0 sha, 5 defl — deflections hurt Base: 13 pkg, 0 cmt, 11 sha,10 defl — unusable Head-to-head (Fused wins): vs Retrieval: 22-10 vs Full-Hist: 20-17 vs Hybrid: 27-14 vs Base: 33-1 Conclusion: fused cards+commits retrieval is the optimal architecture for repo-specific Q&A. Cards provide pattern aggregation (co-change clusters, repeated fix counts), commits provide specific examples (SHAs, files). LoRA adds marginal value at this dataset size.
- gen_cluster_dataset.py: generates training data by clustering commits per file (752 files with >=3 commits, 200 clusters x 3 questions each) - gen_large_dataset.py: synthetic dataset via base model self-instruct - Training data produced: - clusters dataset: 542 examples (488 train + 54 valid) - large dataset: 117 examples (106 train + 11 valid) - gen_cluster_dataset supports --dry-run mode for estimating coverage - Each example: file commit history context + question -> aggregated answer
…esn't beat fused retrieval - gen_distill_dataset.py: generates training data in exact inference format (280 high-quality examples from 300 candidates, filtered 20 low-quality) - Distill adapter: 300 iters, val loss 2.3 -> 0.194 (excellent convergence) - Full 45-question eval vs 5 other modes - Head-to-head: Distill wins 5/45, Fused wins 27/45, Tie 13/45 - Distill adapter improves formatting (40 SHA refs, structured answers) but doesn't beat base model on factual precision with same context - Confirms: retrieval is the truth engine. LoRA improves formatting, not factual recall.
- Query type detection: pattern/history/risk/package/test/general - Dynamic context budget: 2-6 cards, 6-14 commits per question type - Dedup by SHA/package/subject, group commits by package - Retrieve 50 candidates, pack down to diverse set - 45-question eval: v2 wins on breadth (33 vs 30 qs with pkg refs), v1 wins on depth (147 vs 111 total pkg refs) - Head-to-head: v2 11 - v1 24 (ties: 10) - Key improvement: Q17 went from 0->2 pkg refs (previously worst question) - Diminishing returns: 1.5B model's 2048-token context is the ceiling
- Modal 7B setup: Qwen2.5-Coder-7B-Instruct on A10G with 32k context - Teacher batch pipeline: local retrieval + Modal 7B inference - Hybrid search index: BM25 + all-MiniLM-L6-v2 embeddings for 4089 commits with RRF fusion (384-dim, 6.3MB) - Full 45-question eval on hybrid search: 31/45 qs covered, 43 SHAs, 0 deflections - Head-to-head: hybrid beats retrieval (23-13), full-hist (22-17), ties with fused v2 (18-17), loses to fused v1 (17-22) - First 3 teacher targets generated via Modal 7B: specific SHAs, dates, paths
- Qwen2.5-Coder-7B-Instruct on Modal A10G for teacher generation - 50 teacher targets generated (0 failures) - Trained teacher7b adapter: 300 iters, val loss 2.21->0.73 - 45-question eval: 83 pkg refs, 33/45 qs covered, 35 SHAs, 1 deflection - Head-to-head beats FUSED v1 (20-18), RETRIEVAL (24-13), FULL-HIST (24-17) - Ties with HYBRID (17-17) - First time any approach beats the fused retrieval baseline - Full experiment post-mortem written to Obsidian vault
…her + distillation
…ions - Generated 200 teacher targets via Modal 7B (Qwen2.5-Coder-7B on A10G) - Each target: hybrid retrieval context + question + 7B answer - Training: 300 iters, rank 8, 4 layers, val loss 2.19->0.543 - 45-question eval: 118 pkg refs, 35/45 covered (BEST), 0 deflections - Head-to-head: beats v1 (20-18), retrieval (23-16), full-hist (21-17), ties with hybrid (19-19), loses to fused (17-19) - Teacher7b v2 is the most well-rounded adapter: broadest coverage, zero deflections, strong package referencing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR records the full
repo-archexperiment onfiale-plus/piand adds the current best adapter artifact (teacher7b-v2) plus the reproducible runbook.What changed
teacher7b-v2adapter and eval artifactdocs/experiments/pi-repo-arch-experiment.mddocs/experiments/ci-pipeline.mdwith the current pipelinescripts/teacher_batch_v2.pyso batch results preserveid/questionmappingBest result
teacher7b-v2is the most successful version overall for the repo-insights goal:2.19 -> 0.543118 pkg refs,35/45questions with package refs,0deflectionsteacher7b v1and the retrieval/card baselines; FUSED v1 still slightly edges it on raw package refsTTFS / TPS
Warm-cache benchmark on 4 representative questions, 2 repeats each,
top_k=5:0.51s(local cache warm)290-324ms615-906ms322-616ms36.7-40.0 tok/s(avg38.14 tok/s)Experiment chronology
How to run inference
Modal usage
modal deploy scripts/modal_7b.pymodal run scripts/teacher_batch.py --limit 50modal run scripts/teacher_batch_v2.py --limit 150 --parallel 4modal.Functionscripts, not notebooks or web endpointsFunction.from_name("pi-7b-teacher", "generate")is the deployed entrypointModal issues encountered
Secret.from_name(..., required=False)changed in Modal 1.4.xFunction.lookup()is no longer the right entrypoint; useFunction.from_name()container_idle_timeoutwas renamed toscaledown_windowidandquestionto avoid out-of-order mappingNotes
Retrieval stays local; Modal is used only for teacher inference.
Full session logs and raw evals are mirrored in the Obsidian vault.