Skip to content

docs(repo-arch): teacher7b v2 experiment and adapter#1

Open
bearmug wants to merge 13 commits into
mainfrom
repo_arch_pi_experiment_runbook
Open

docs(repo-arch): teacher7b v2 experiment and adapter#1
bearmug wants to merge 13 commits into
mainfrom
repo_arch_pi_experiment_runbook

Conversation

@bearmug
Copy link
Copy Markdown

@bearmug bearmug commented May 17, 2026

Summary

This PR records the full repo-arch experiment on fiale-plus/pi and adds the current best adapter artifact (teacher7b-v2) plus the reproducible runbook.

What changed

  • Added the teacher7b-v2 adapter and eval artifact
  • Documented the experiment outcome in docs/experiments/pi-repo-arch-experiment.md
  • Updated docs/experiments/ci-pipeline.md with the current pipeline
  • Fixed scripts/teacher_batch_v2.py so batch results preserve id / question mapping

Best result

teacher7b-v2 is the most successful version overall for the repo-insights goal:

  • 200 teacher targets from Modal 7B
  • 300 iters, rank 8, 4 layers
  • val loss: 2.19 -> 0.543
  • behavioral eval: 118 pkg refs, 35/45 questions with package refs, 0 deflections
  • head-to-head: beats teacher7b v1 and the retrieval/card baselines; FUSED v1 still slightly edges it on raw package refs

TTFS / TPS

Warm-cache benchmark on 4 representative questions, 2 repeats each, top_k=5:

  • cold load: 0.51s (local cache warm)
  • retrieval: 290-324ms
  • TTFS total: 615-906ms
  • TTFS model-only: 322-616ms
  • throughput: 36.7-40.0 tok/s (avg 38.14 tok/s)

Experiment chronology

Order Phase Data Key change Result
1 LoRA v1 92 examples question -> answer all "No historical warnings"
2 LoRA v2 120 examples more cards, still no context all "No historical warnings"
3 Retrieval baseline cards + commits deterministic facts strong baseline
4 FUSED v1 cards + commits + base fused retrieval + synthesis 147 pkg refs, 30/45, 0 deflections
5 Teacher7B v1 50 teacher targets distilled 7B -> 1.5B 83 pkg refs, 33/45, beats FUSED v1
6 Teacher7B v2 200 teacher targets more data, same rank/layers 118 pkg refs, 35/45, 0 deflections

How to run inference

source /opt/homebrew/var/mtplx/venv-0.1.0rc3/bin/activate
cd /Users/pavel/repos/fiale-plus/pi
python3 scripts/hybrid_answer.py --question "What should I know about agent-session.ts?" --adapter .repo-arch/adapters/teacher7b-v2

Modal usage

  • Deploy teacher: modal deploy scripts/modal_7b.py
  • Generate targets:
    • modal run scripts/teacher_batch.py --limit 50
    • modal run scripts/teacher_batch_v2.py --limit 150 --parallel 4
  • Use raw modal.Function scripts, not notebooks or web endpoints
  • Function.from_name("pi-7b-teacher", "generate") is the deployed entrypoint

Modal issues encountered

  • Secret.from_name(..., required=False) changed in Modal 1.4.x
  • Function.lookup() is no longer the right entrypoint; use Function.from_name()
  • container_idle_timeout was renamed to scaledown_window
  • Parallel batch results must carry id and question to avoid out-of-order mapping

Notes

Retrieval stays local; Modal is used only for teacher inference.
Full session logs and raw evals are mirrored in the Obsidian vault.

bearmug added 13 commits May 15, 2026 23:02
- Runbook: staged extraction, curation, dataset, and MLX LoRA training
- 4086 commits mined, 20 cards, 92 training examples
- LoRA adapter trained on Qwen2.5-Coder-1.5B (rank 8, 100 iters)
- Behavioral eval harness: 45 questions x 3 modes (base/retrieval/lora)
- Sanity run results: retrieval-only wins decisively (62 file mentions, 0 deflections)
  vs base (14 mentions, 5 deflects) vs lora (10/10 'no warnings found')
- Current LoRA adapter is not useful — training data too small (82 examples),
  negative ratio too high (39%), training observability missing
- Do NOT scale to cloud GPU until dataset improves
- Experiment report with quality notes, risks, and next actions
…RA retrain

- Lowered card min-confidence to 0.3, max-cards to 50 → 25 cards (+5)
- Curated 18 training-relevant cards (accepted code-level patterns,
  rejected CHANGELOG churn, design rationale, package-lock noise)
- Dataset grew from 92 to 120 examples, negative ratio 39%→30.8%
- Retrained with v2 adapter: 200 iters, val loss 3.3→0.69
- Behavioral eval result IDENTICAL to v1: all answers 'No historical
  warnings found'
- Conclusion: LoRA-only is fundamentally insufficient for repo Q&A.
  Training data format (question→answer without context) cannot teach
  the model to distinguish covered vs uncovered files.
- Right architecture: hybrid retrieval+LoRA, where retrieval provides
  card context and adapter formats/interpretes it.
- hybrid_export.py: generates context+question->answer training data from
  repo-arch cards. Each example includes up to 4 related cards as context.
- hybrid_answer.py: inference script that retrieves cards via repo-arch
  similar, formats as context, and answers with MLX LoRA adapter.
- Trained hybrid adapter (300 iters, val loss 2.39->0.35, 37 examples)
- Behavioral test: hybrid adapter correctly used card context for
  agent-session.ts (cited 94 fixes) and admitted ignorance for uncovered topics.
- Base model with context ALSO works well - the context itself contains the
  answers. Hybrid LoRA adds marginal value at this dataset size.
- Conclusion verified: retrieval is the truth engine, LoRA is optional
  formatting aid for larger datasets.
- build_index.py: indexes all 4089 unique commits as BM25 search documents
- full_retrieve.py: search the commit index
- full_history_answer.py: full commit history retrieval + base model answer
- fused_answer.py: combined repo-arch cards + commit history retrieval + answer
- 5-way eval comparison across 45 behavioral questions

Results (147 pkg refs = best):
  FUSED:     147 pkg, 21 cmt, 33 sha, 0 defl — beats all modes head-to-head
  Retrieval:  97 pkg, 18 cmt,  0 sha, 0 defl — strong baseline
  Full-hist: 124 pkg,  2 cmt, 53 sha, 0 defl — broader but misses aggregation
  Hybrid:     66 pkg, 27 cmt,  0 sha, 5 defl — deflections hurt
  Base:       13 pkg,  0 cmt, 11 sha,10 defl — unusable

Head-to-head (Fused wins):
  vs Retrieval: 22-10  vs Full-Hist: 20-17
  vs Hybrid:    27-14  vs Base:     33-1

Conclusion: fused cards+commits retrieval is the optimal architecture for
repo-specific Q&A. Cards provide pattern aggregation (co-change clusters,
repeated fix counts), commits provide specific examples (SHAs, files).
LoRA adds marginal value at this dataset size.
- gen_cluster_dataset.py: generates training data by clustering commits
  per file (752 files with >=3 commits, 200 clusters x 3 questions each)
- gen_large_dataset.py: synthetic dataset via base model self-instruct
- Training data produced:
  - clusters dataset: 542 examples (488 train + 54 valid)
  - large dataset: 117 examples (106 train + 11 valid)
- gen_cluster_dataset supports --dry-run mode for estimating coverage
- Each example: file commit history context + question -> aggregated answer
…esn't beat fused retrieval

- gen_distill_dataset.py: generates training data in exact inference format
  (280 high-quality examples from 300 candidates, filtered 20 low-quality)
- Distill adapter: 300 iters, val loss 2.3 -> 0.194 (excellent convergence)
- Full 45-question eval vs 5 other modes
- Head-to-head: Distill wins 5/45, Fused wins 27/45, Tie 13/45
- Distill adapter improves formatting (40 SHA refs, structured answers)
  but doesn't beat base model on factual precision with same context
- Confirms: retrieval is the truth engine. LoRA improves formatting,
  not factual recall.
- Query type detection: pattern/history/risk/package/test/general
- Dynamic context budget: 2-6 cards, 6-14 commits per question type
- Dedup by SHA/package/subject, group commits by package
- Retrieve 50 candidates, pack down to diverse set
- 45-question eval: v2 wins on breadth (33 vs 30 qs with pkg refs),
  v1 wins on depth (147 vs 111 total pkg refs)
- Head-to-head: v2 11 - v1 24 (ties: 10)
- Key improvement: Q17 went from 0->2 pkg refs (previously worst question)
- Diminishing returns: 1.5B model's 2048-token context is the ceiling
- Modal 7B setup: Qwen2.5-Coder-7B-Instruct on A10G with 32k context
- Teacher batch pipeline: local retrieval + Modal 7B inference
- Hybrid search index: BM25 + all-MiniLM-L6-v2 embeddings for 4089 commits
  with RRF fusion (384-dim, 6.3MB)
- Full 45-question eval on hybrid search: 31/45 qs covered, 43 SHAs, 0 deflections
- Head-to-head: hybrid beats retrieval (23-13), full-hist (22-17),
  ties with fused v2 (18-17), loses to fused v1 (17-22)
- First 3 teacher targets generated via Modal 7B: specific SHAs, dates, paths
- Qwen2.5-Coder-7B-Instruct on Modal A10G for teacher generation
- 50 teacher targets generated (0 failures)
- Trained teacher7b adapter: 300 iters, val loss 2.21->0.73
- 45-question eval: 83 pkg refs, 33/45 qs covered, 35 SHAs, 1 deflection
- Head-to-head beats FUSED v1 (20-18), RETRIEVAL (24-13), FULL-HIST (24-17)
- Ties with HYBRID (17-17)
- First time any approach beats the fused retrieval baseline
- Full experiment post-mortem written to Obsidian vault
…ions

- Generated 200 teacher targets via Modal 7B (Qwen2.5-Coder-7B on A10G)
- Each target: hybrid retrieval context + question + 7B answer
- Training: 300 iters, rank 8, 4 layers, val loss 2.19->0.543
- 45-question eval: 118 pkg refs, 35/45 covered (BEST), 0 deflections
- Head-to-head: beats v1 (20-18), retrieval (23-16), full-hist (21-17),
  ties with hybrid (19-19), loses to fused (17-19)
- Teacher7b v2 is the most well-rounded adapter: broadest coverage,
  zero deflections, strong package referencing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant