A team of AI agents that builds software from a plain-English requirement — creating GitHub Issues, feature branches, pull requests, tests, and deployment smoke tests automatically.
Built on the GitHub Models API — the same AI backbone that powers GitHub Copilot CLI.
- 9 specialised agent types (10+ agents in parallel): PM → PM Reviewer → Architect → Arch Reviewer → Engineers ×N → Code Reviewer → QA Planner → QA Engineer → Deployment Tester
- Checkpoint / resume — interrupted runs pick up from the last successful stage
- Multi-repo routing — agents push to a target repo; tracking issues live in a central
ai-software-houserepo - Per-agent LLM config — assign any GitHub Models model to each agent independently
- Per-repo LLM config — each repo can declare its own
llm:section inrepos-available/*.yamlthat overrides the global config for that repo only (model, per-agent overrides, fallback chains, pool limits) - 📣 PR & marketing campaign pipeline — label an issue
pr-campaignto run a 3-stage pipeline (Analyst → Creative → Proposal) that researches your campaign brief and outputs a polished proposal PR with social copy and platform tactics - 💬 Multi-agent brainstorm stage — add
discuss_brainstormto any pipeline to run a moderated debate; participants can be any agent role file, inline personas, or auto-selected by the LLM from a pool; transcript and synthesis injected into all downstream stages - Actual test execution — pytest runs locally; results posted back to the PR as a comment
- Pluggable deploy backends — deployment tester generates smoke tests; each repo independently chooses
none,docker(local docker-compose), orlibvirt(remote VM via SSH + CoW overlay) for its deploy test strategy - GitHub Actions integration — label an issue to trigger the full pipeline automatically; 15-minute watcher catches pre-labelled issues too
- PR feedback loop — humans post review comments on AI-generated PRs → Engineer + Code Reviewer + QA automatically re-run, push fixes, and update the PR (up to
max_revisionsrounds) - Auto update-branch — watcher detects
update-branchPR comments and automatically merges the base branch into the PR branch, keeping it up to date without human intervention - AI conflict resolution — when a merge conflict is detected,
ConflictResolverAgentclones the repo locally, uses real 3-way git conflict markers (<<<</====/>>>>), and resolves each file with a configurable strong LLM; configurable per-repo viaconflict_resolver_modelinrepos.yaml - Tool calling built-in — Code Reviewer runs
ruff, QA Planner searches GitHub Issues; any agent can call tools viacall_with_tools() - MCP server support — connect any MCP-compatible server (stdio or SSE); tools are automatically merged and injected into tool-calling agents
- 🔍 RAG knowledge base — Engineer, Architect, and QA Engineer agents can search an indexed pgvector knowledge base (codebase, past designs, docs) via
search_codebase,search_memory, andsearch_docstools — powered by Ollama, vLLM, or OpenAI embeddings - Pluggable skill system — skills are markdown files in
skills/that inject domain-specific guidance into agent prompts; auto-detected from project context (issue body, repo languages) or always-loaded from config - 🧩 Custom pipeline stages — define any stage sequence (including review loops) in a
pipeline.yamlfile; use the built-in browser GUI (--config-builder) to build and save it without editing YAML by hand - 🏷️ Label → pipeline dispatch — each GitHub label maps to a
pipelines/<label>.yamlfile; add a new pipeline type by creating one YAML file, no Python or new workflow required - Fully customisable — add agents, skills, and tools by editing markdown role files and Python tool functions
- 🧠 Agent memory — tiered SQLite memory (run → monthly → quarterly), conversation history within each run, auto-summariser after every pipeline
- 🌙 Refactor / dream mode —
--refactorflag analyses and cleans up workspace code, opens a cleanup PR - 🤖 13 LLM backends — GitHub Models (default), Anthropic Claude, Ollama (local), OpenCode CLI, OpenCode Zen API, OpenCode Go API, Grok CLI, Grok OAuth, NVIDIA NIM, Alibaba DashScope, GitHub Copilot, OpenAI API, and Codex CLI; switch per-agent with a model prefix
- ⚡ Two-level concurrency — per-repo
parallel_issuescap + globalsettings.max_parallel; per-LLM-backend semaphore pools keep local Ollama at 1 concurrent call - Resilient checkpoints — atomic writes prevent corruption on Ctrl+C; best-checkpoint-wins logic survives bad config runs
- 🗺️ Repo context awareness — before engineering, the pipeline injects the full repo file tree into PM/Architect prompts (small repos) or auto-indexes the codebase into RAG (large repos), so agents understand what already exists before writing code
- 🔁 Pipeline self-chaining — after a run, agents automatically re-label issues for follow-up pipelines (bug fix, re-review) without human intervention; configurable rules in
config.yaml - 💰 Token usage & cost tracking — per-run token counts and USD cost per model; flushed to SQLite; optional GitHub issue comment with per-stage breakdown; configurable pricing table in
config.yaml - ⚡ Streaming for all backends — streaming responses from GitHub Models, Anthropic, OpenCode Go, and Ollama; configurable per-agent
- 🧪 TDD early-commit — in TDD pipeline mode, test files can be committed to a branch early so engineers see failing tests before implementing
- 📊 Prometheus metrics — standalone
metrics_server.pyexposesaisw_circuit_breaker_events_total,aisw_dlq_events_total, andaisw_degradation_events_totalcounters; wired viametrics_urlinwatchers.yml→ events fire-and-forget to the sink so the watcher is never blocked - 🎯 Agent Accuracy System — four-layer system to prevent, detect, learn from, and bootstrap against agent mistakes: context injection (Layer 1), validation gate before every PR (Layer 2),
LearningAgentthat writes DO NOT rules from failures (Layer 3), andBootstrapPatternsAgentthat seeds new repos with cheatsheets from day zero (Layer 4) - ✅ Validation gate — syntax check → ruff lint → pytest runs before any PR is opened; failures re-prompt the engineer with the exact error message (max 2 retries); hardened on
ai-feature.yaml,ai-fix.yaml,tdd.yaml, andai-smart-fix.yaml
The minimal setup to run the core pipeline — no Docker, no GitHub Actions, no reviewers. Just PM → Architect → Engineers → Code Reviewer → QA Engineer pushing code to a GitHub repo.
git clone https://github.com/your-username/ai-software-house
cd ai-software-house
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt- Go to github.com/settings/tokens → Tokens (classic)
- Generate new token → tick
reposcope (this also enables GitHub Models access) - Copy the token (
ghp_...)
export GITHUB_TOKEN=ghp_your_token_here
⚠️ Must be a classic PAT, not a fine-grained token. Fine-grained tokens return401 models permission required.
Create an empty public repo on GitHub (e.g. your-username/my-first-agent-app).
The pipeline will initialise it automatically.
Edit config.yaml — change just one line:
github:
repo: "your-username/my-first-agent-app" # ← your new repoDisable the optional agents to keep it fast:
team:
num_engineers: 1 # start with 1 engineer
agents:
product_manager: true
pm_reviewer: false # skip for MVP
architect: true
engineer: true
code_reviewer: true
qa_planner: false # skip for MVP
qa_engineer: true
deployment_tester: false # skip — needs Dockerpython main.py \
--requirement "Build a simple REST API for a todo list with FastAPI" \
--repo your-username/my-first-agent-appworkspace/
simple-todo-rest-api/ ← generated code saved locally
GitHub:
Issue #1 ← PRD created by Alice (PM)
Branch: feature/agent-... ← code pushed by Alex (Engineer)
PR #2 ← pull request with code review + test files
| MVP | Full | |
|---|---|---|
| Agents | 4 core agents | 9 agent types |
| Reviewers | Code Reviewer only | PM Reviewer + Arch Reviewer + Code Reviewer |
| Test planning | QA Engineer only | QA Planner → QA Engineer |
| Deployment tests | ❌ | ✅ Docker / libvirt / none (per-repo) |
| GitHub Actions | ❌ | ✅ Auto-trigger on issue labels |
| Time to first PR | ~2–3 min | ~5–10 min |
Once the MVP works, turn agents back on one by one in config.yaml.
- Python 3.11+
- A GitHub classic PAT (not fine-grained) with scopes:
repo+read:org(for GitHub Models access) - Docker (optional — for deployment smoke tests)
git clone https://github.com/your-username/ai-software-house
cd ai-software-house
python -m venv venv && source venv/bin/activate
pip install -r requirements.txtcp config.yaml config.local.yaml # optional — edit as needed
export GITHUB_TOKEN=ghp_your_classic_patUsing Anthropic Claude models? Add your Anthropic key alongside
GITHUB_TOKEN:# For Anthropic Claude models (claude-sonnet-4.6, claude-opus-4.5, etc.) export ANTHROPIC_API_KEY=sk-ant-your-key-hereModel names starting with
claude-are automatically routed to the Anthropic API.GITHUB_TOKENis still required for all GitHub operations.
Using OpenCode Zen or Go models? Export the OpenCode API key:
export OPENCODE_ZEN_API_KEY=your-opencode-api-keyBoth
opencode-zen/andopencode-go/model prefixes share this key. Get one at https://opencode.ai/auth.
Using OpenAI API (BusinessChatGPT)? Export your OpenAI API key:
export OPENAI_API_KEY=sk-your-key-hereUse the
openai/prefix:openai/gpt-4o,openai/gpt-4.1-mini, etc.
Using NVIDIA NIM? Export your NVIDIA API key:
export NVIDIA_API_KEY=nvapi-your-key-hereUse the
nvidia-nim/prefix:nvidia-nim/meta/llama-3.1-70b-instruct, etc.
Using Alibaba DashScope / Qwen? Export your DashScope API key:
export DASHSCOPE_API_KEY=sk-your-key-hereUse the
dashscope/prefix:dashscope/qwen3-plus,dashscope/qwen3-turbo, etc.
Edit config.yaml:
github:
repo: "your-username/your-repo" # where code will be pushedTo use a local Ollama server instead of GitHub Models or Anthropic:
-
Set any agent's model to an
ollama/prefixed name inconfig.yaml:llm: model: "ollama/llama3.2" overrides: engineer: "ollama/qwen2.5-coder"
-
Create
config.local.yamlalongsideconfig.yamlto set your Ollama server URL (this file is gitignored — never committed):llm: ollama_url: "http://your-ollama-host:11434"
If omitted, defaults to
http://localhost:11434. -
Pull the required model on your Ollama server:
ollama pull llama3.2
Run models through the OpenCode CLI instead of a direct API:
-
Install and authenticate OpenCode:
npm install -g opencode-ai # or follow opencode.ai instructions opencode auth login -
Set model names with the
opencode/<provider>/<model>prefix:llm: model: "opencode/anthropic/claude-sonnet-4-5" overrides: engineer: "opencode/openai/gpt-4o"
OpenCode resolves the provider from its own auth config. Override the binary path with
OPENCODE_BINif needed.
⚠️ OpenCode CLI does not support tool-calling. The Code Reviewer agent will fail if assigned anopencode/model. Useopencode-go/oropencode-zen/for tool-calling agents.
Direct HTTP access to OpenCode Zen (Claude, GPT, Gemini, and more):
-
Get an API key at https://opencode.ai/auth and export it:
export OPENCODE_ZEN_API_KEY=your-key-here -
Use the
opencode-zen/<model-id>prefix:llm: model: "opencode-zen/claude-sonnet-4-6" overrides: engineer: "opencode-zen/gpt-5.3-codex"
Claude models route to the Anthropic Messages endpoint; all others use the OpenAI-compatible endpoint and support tool-calling.
Access the OpenCode Go plan models (Kimi, Qwen, GLM, MiMo, MiniMax):
-
Same API key as Zen —
OPENCODE_ZEN_API_KEY. -
Use the
opencode-go/<model-id>prefix:llm: model: "opencode-go/kimi-k2.5" overrides: engineer: "opencode-go/qwen3.6-plus"
Model ID Endpoint Tool-calling kimi-k2.5,qwen3.6-plus,qwen3.5-plus,glm-5.1,glm-5,mimo-v2-pro,mimo-v2-omni/chat/completions✅ minimax-m2.7,minimax-m2.5Anthropic /messages❌ Override the base URL with
OPENCODE_GO_BASE_URLif needed.
Run xAI Grok models via the Grok CLI subprocess:
-
Install the Grok CLI and ensure it is authenticated.
-
Set model names with the
grok/<model>prefix:llm: model: "grok/grok-3" overrides: engineer: "grok/grok-3-mini"
Override the binary path with
GROK_BINif needed.
⚠️ Grok CLI does not support tool-calling. The Code Reviewer agent will fail if assigned agrok/model.
Direct HTTP access to the xAI API using the OAuth browser flow:
-
Use the
grok-oauth/<model>prefix — the first call opens a browser for xAI OAuth login:llm: model: "grok-oauth/grok-3" overrides: engineer: "grok-oauth/grok-3-mini"
Token is refreshed automatically. Override client ID with
XAI_OAUTH_CLIENT_IDif needed.Supports tool-calling (OpenAI-compatible endpoint).
Access NVIDIA-hosted models (Llama, Mistral, Nemotron, etc.) via the NIM inference API:
-
Get an API key from build.nvidia.com and export it:
export NVIDIA_API_KEY=nvapi-your-key-here -
Use the
nvidia-nim/<model>prefix:llm: model: "nvidia-nim/meta/llama-3.1-70b-instruct" overrides: engineer: "nvidia-nim/mistralai/mistral-7b-instruct-v0.3"
Override the endpoint with
NVIDIA_NIM_BASE_URL(default:https://integrate.api.nvidia.com/v1).Supports tool-calling.
Access Alibaba Cloud Qwen and other DashScope models:
-
Get an API key from dashscope.aliyuncs.com and export it:
export DASHSCOPE_API_KEY=sk-your-key-here -
Use the
dashscope/<model>prefix:llm: model: "dashscope/qwen3-plus" overrides: engineer: "dashscope/qwen3-turbo"
Default endpoint:
https://dashscope-intl.aliyuncs.com/compatible-mode/v1. Override withdashscope_urlin config.Supports tool-calling.
Use GitHub Copilot's internal inference API — requires an active Copilot subscription:
-
Auth is auto-discovered from
~/.copilot/config.json(set by the Copilot CLI). Or export the token directly:export COPILOT_OAUTH_TOKEN=gho_your-copilot-token -
Use the
copilot/<model>prefix:llm: model: "copilot/gpt-4o" overrides: engineer: "copilot/gpt-4.1-mini"
Supports tool-calling.
Direct access to the OpenAI API (api.openai.com) — for ChatGPT Plus/Team/Enterprise subscribers using a standard API key:
-
Export your OpenAI API key:
export OPENAI_API_KEY=sk-your-key-here -
Use the
openai/<model>prefix:llm: model: "openai/gpt-4o" overrides: engineer: "openai/gpt-4.1-mini"
Supports tool-calling. Fallbacks work as normal across all agents.
Run the OpenAI Codex CLI agent (codex exec) as a subprocess — requires a ChatGPT Plus/Pro account:
-
Install and sign in:
curl -fsSL https://chatgpt.com/codex/install.sh | sh # or: npm install -g @openai/codex codex # sign in with your ChatGPT account on first run
-
Use the
codex/<model>prefix:llm: model: "codex/codex-mini-latest" overrides: engineer: "codex/o4-mini"
Override the binary path with
CODEX_BINif needed.
⚠️ Codex CLI does not support tool-calling. The Code Reviewer agent will fail if assigned acodex/model.
# From a requirement file
python main.py --file requirements/my-app.txt --repo owner/target-repo
# From a string
python main.py --requirement "Build a REST API for a todo app" --repo owner/target-repo
# Resume an interrupted run (default — checkpoint auto-detected)
python main.py --file requirements/my-app.txt --repo owner/target-repo
# Start fresh (ignore checkpoint)
python main.py --file requirements/my-app.txt --repo owner/target-repo --no-resumeResume behaviour: checkpoints are written atomically after each stage completes, so a Ctrl+C or crash mid-stage never corrupts the saved state. If the same requirement has been run before with different config (e.g. a wrong model name), the pipeline automatically picks up the checkpoint with the most completed stages — a failed partial run can never roll back progress.
1. 📋 Product Manager — requirement → PRD + GitHub Issue
2. 📝 PM Reviewer — reviews PRD; optionally revises before architecture
3. 🏗️ Architect — PRD → system design + module list
4. 🔎 Arch Reviewer — reviews design; optionally revises before engineering
5. 🗺️ Repo Indexer — injects repo tree into prompts (small repos) or auto-indexes codebase into RAG (large repos)
6. 💻 Engineers ×N — parallel code generation → feature branch + PR
7. 🔍 Code Reviewer — reviews code → PR comment with verdict
8. 📋 QA Planner — PRD + design + code → structured test plan + acceptance criteria
9. 🧪 QA Engineer — implements tests guided by QA Planner's test plan → PR
10. 🏃 Test Runner — runs pytest locally → PR comment with results
11. 🚀 Deployment Tester — generates docker-compose.test.yml + smoke tests → PR
12. 🐳 Deploy Test Runner — runs deploy smoke tests via the repo's configured backend (docker / libvirt / none) → PR comment
13. 🧠 Summariser — writes compact memory entry (what was built, decisions, feedback, tech debt)
**Bug-fix stages** (used in `ai-fix` pipeline):
14. 🔬 Diagnose — reads issue + codebase, pinpoints root cause
15. 🐛 Bug Fix — applies targeted fix → branch + PR
**Documentation stages** (used in `ai-docs` pipeline):
16. 📝 Doc Generate — reads existing docs + source, writes/updates documentation files
17. 📤 Doc Commit/PR — commits doc files to a branch and opens a PR
By default the pipeline runs all stages in the order shown above. You can replace this with any custom stage sequence by creating a pipeline.yaml in the project root.
stages:
- pm
- pm_reviewer
- architect
- architect_reviewer
- loop:
max: 3
until: APPROVED
stages:
- engineer
- code_reviewer
- qa_planner
- qa_engineer
- test_runner
- deployment_tester
- deploy_test_runner
- summariserPlain stages are stage names (strings) — any of the stages listed in the pipeline table above.
Loop blocks repeat an inner stage sequence until a reviewer verdict matches until (or max iterations are reached):
| Field | Description |
|---|---|
max |
Maximum iterations before moving on (required) |
until |
Verdict that exits the loop: APPROVED, NEEDS_REVISION, or CHANGES REQUESTED |
stages |
Inner stages to repeat (list of stage names) |
Typical loop pattern: wrap engineer + code_reviewer so the reviewer can push the engineer to fix issues before QA runs.
Instead of editing YAML by hand, launch the browser-based GUI:
python main.py --config-builderThis opens a local web server (URL printed to console) with a drag-and-drop palette:
- Palette — lists every available stage, colour-coded by category
- Pipeline canvas — drag stages from the palette to build your sequence; drag to reorder
- Loop blocks — drag the "Loop" stage into the canvas and configure
max/until/ inner stages - Save — writes
pipeline.yamlnext to yourconfig.yaml
No GitHub token needed — --config-builder exits before any network calls.
If pipeline.yaml exists, it overrides the pipeline.mode setting in config.yaml. To restore default mode, delete (or rename) pipeline.yaml.
Each GitHub label can trigger its own pipeline. The watcher picks the pipeline file based on the label name.
Built-in pipelines:
| Label | Pipeline File | Purpose |
|---|---|---|
ai-feature |
pipelines/ai-feature.yaml |
Full feature build (PM → Architect → Engineer → QA) |
ai-fix |
pipelines/ai-fix.yaml |
Bug-fix flow (diagnose → fix → review → test) |
ai-docs |
pipelines/ai-docs.yaml |
Generate documentation and open a PR |
Custom pipelines: Create pipelines/<your-label>.yaml with a stages: list and add the label to your repo entry in repos.yaml. See Custom Pipeline (pipeline.yaml) for the full format.
Per-project override: A pipeline.yaml at the project's root takes precedence over the built-in pipelines/<label>.yaml.
Two independent layers control parallelism:
- Per-repo:
parallel_issues: Ninrepos.yaml— how many issues from one tracker repo run at once. Default:1. - Per-LLM-backend:
llm.pools.<backend>: Ninconfig.yaml— how many simultaneous calls to that backend across all running pipelines. Default:ollama: 1, others5.
This means you can run feature pipelines in parallel against multiple repos but still keep your local Ollama instance at one call at a time.
Every pipeline run contributes to a tiered, persistent memory store so the system learns from past work on a repo.
Each run → run summary (SummaryAgent writes after pipeline)
│
▼ (after 10 run entries)
monthly snapshot (MemoryConsolidatorAgent)
│
▼ (after 3 monthly entries)
quarterly index (MemoryConsolidatorAgent)
recall() always returns: all quarterly entries + all monthly entries + last 3 run summaries — capped at ~2 200 words regardless of total run count.
| File | Location | Purpose |
|---|---|---|
memory.db |
workspace/<repo>/memory.db |
SQLite store (all tiers) |
memory.md |
workspace/<repo>/memory.md |
Human-readable log of all entries |
Long-term memory is loaded at the start of each run and injected as a ## 📚 Memory block into every agent's system prompt.
from orchestrator import Orchestrator
orch = Orchestrator(model="gpt-4.1", github_token="ghp_...", target_repo="owner/repo")
# View stats for a repo
stats = orch.memory.stats("owner/repo")
# → {"runs": 7, "monthly": 1, "quarterly": 0}
# Keyword search across memory entries
results = orch.memory.search("owner/repo", ["auth", "JWT"])
# → list of matching memory entriesA standalone cleanup pass that scans existing workspace code, identifies code smells and tech debt, rewrites flagged files, and opens a cleanup PR. It does not run the normal build pipeline.
CLI:
python main.py --refactor --repo owner/target-repoPython API:
result = orch.refactor()
# Returns:
# {
# "plan": "...", # identified smells & refactor plan
# "changes": {"file.py": "new content", ...}, # rewritten files
# "pr_url": "https://github.com/..."
# }The Memory Bank gives Copilot CLI persistent context across sessions. Six structured Markdown files are committed to each target repo and auto-read by Copilot at session start.
| File | Updated | Purpose |
|---|---|---|
memory-bank/projectbrief.md |
Rarely | Goals, scope, core requirements |
memory-bank/productContext.md |
Rarely | Why it exists, user problems, UX goals |
memory-bank/systemPatterns.md |
On design change | Architecture, patterns, conventions |
memory-bank/techContext.md |
On stack change | Tech stack, dependencies, environment |
memory-bank/activeContext.md |
Every run | Current focus, recent changes, next steps |
memory-bank/progress.md |
Every run | Done / in-progress / blockers |
After every successful pipeline run, the MemoryBankUpdaterAgent reads the current bank files via the GitHub API and commits updated activeContext.md and progress.md (and optionally systemPatterns.md / techContext.md) to the feature branch. No manual action required.
To add a Memory Bank to any project, use the templates from copilot-agent-setting:
cd /path/to/copilot-agent-setting
./deploy-memory-bank.sh /path/to/your-projectThen fill in memory-bank/projectbrief.md and memory-bank/productContext.md. The pipeline will keep the other files up to date automatically.
| Mode | How |
|---|---|
| Fully automatic | Pipeline updates bank after every run — no action needed |
| Semi-automatic | ./install-memory-bank-hook.sh /path/to/project — git post-commit hook updates bank after every commit |
| Manual | ./update-memory-bank.sh "summary" — run from inside the project |
| Agent | Name | Input | Output | GitHub Artifact |
|---|---|---|---|---|
| Product Manager | Alice | Raw requirement | PRD markdown | GitHub Issue |
| PM Reviewer | Grace | PRD + requirement | Review + revised PRD (if needed) | Issue comment |
| Architect | Bob | PRD | System design + modules | Issue comment |
| Arch Reviewer | Frank | Design + PRD | Review + revised design (if needed) | Issue comment |
| Engineer ×N | Alex ×N | System design | Source code files | Feature branch + PR |
| Code Reviewer | Carol | Code + PRD | Review verdict | PR comment |
| QA Planner | Henry | PRD + design + code | Test plan + acceptance criteria | Issue/PR comment |
| QA Engineer | Edward | Code + PRD + test plan | Test files + conftest + requirements-test.txt | PR comment + branch |
| Deployment Tester | Diana | Code + Dockerfile | docker-compose.test.yml + smoke tests + deploy script | PR comment + branch |
| Conflict Resolver | — | PR branch + base branch + PR context | Resolved branch (committed + pushed) | PR comment |
python main.py [options]
Input (one required):
--file PATH Path to a .txt file containing the requirement
--requirement TEXT Requirement as a command-line string
Routing:
--repo OWNER/REPO Target repository for code (overrides config.yaml)
Model:
--model MODEL Override model for ALL agents
--model-override AGENT=MODEL Override model for one agent (repeatable)
Agent names: product_manager, pm_reviewer, architect,
architect_reviewer, engineer, code_reviewer,
qa_planner, qa_engineer, deployment_tester
Team:
--engineers N Number of parallel Engineer agents (default: 2)
Pipeline:
--no-resume Ignore checkpoint and start from scratch
--stop-on-review Halt pipeline if Code Reviewer requests changes
--refactor Dream mode: analyse workspace code and open a cleanup PR
--mode {build,revise} 'build' (default) runs full pipeline; 'revise' processes PR feedback
--pr PR_NUMBER PR number to revise — required when --mode=revise
--config-builder Launch browser-based GUI to build/edit pipeline.yaml, then exit
--pipeline NAME Run a named pipeline (matches pipelines/<name>.yaml or built-ins: ai-feature, ai-fix, ai-docs)
--list-pipelines Print all available pipeline names (project + built-in) and exit
watcher.py options:
--once Process all pending issues once, then exit (used by GitHub Actions)
--dry-run Show what would run; make no GitHub changes
--config PATH Use a different repos.yaml file (default: repos.yaml)
from orchestrator import Orchestrator
orch = Orchestrator(
model="gpt-4.1",
github_token="ghp_...",
target_repo="owner/my-app",
num_engineers=3,
)
result = orch.run("Build a REST API for patient questionnaires")
print(result.prd) # PRD markdown
print(result.prd_verdict) # PRD APPROVED / NEEDS REVISION
print(result.design) # System design markdown
print(result.design_verdict) # DESIGN APPROVED / NEEDS REVISION
print(result.qa_plan) # Full test plan from QA Planner
print(result.qa_acceptance_criteria) # ['AC-01', 'AC-02', ...]
print(result.pr_url) # GitHub PR URL
print(result.tests_passed) # True / False / None
print(result.run_id) # UUID for this pipeline run
print(result.total_cost_usd) # estimated USD cost (requires cost_tracking.enabled)
print(result.token_usage) # dict with by_stage and by_model breakdownsProcess human review feedback on an AI-generated PR and push an updated commit.
from orchestrator import Orchestrator
orch = Orchestrator.from_config("config.yaml")
result = orch.run_revision(pr_number=42)
# result["status"] → "approved" | "changes_requested" | "max_revisions_reached" | "error"Via CLI:
python main.py --mode revise --pr 42 --repo owner/target-repollm:
# Default model for all agents
# ── GitHub Models (default) ─────────────────────────────────────────────
# gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini, o4-mini, o3
# claude-3.5-sonnet, claude-3.7-sonnet, claude-3-haiku
# meta-llama-3.3-70b-instruct, mistral-large-2411
# deepseek-r1, deepseek-v3, cohere-command-r-plus
#
# ── Anthropic Claude API (ANTHROPIC_API_KEY required) ──────────────────
# claude-sonnet-4-6, claude-opus-4-5, claude-3-5-sonnet-20241022, ...
# Any model starting with "claude-" is auto-routed to Anthropic.
# ⚠️ Does NOT support tool-calling (Code Reviewer will fail).
#
# ── Ollama (local, ollama_url below) ───────────────────────────────────
# ollama/llama3.2, ollama/qwen2.5-coder, ollama/mistral, ...
# Tool-calling ✅
#
# ── OpenCode CLI (opencode must be installed + authenticated) ───────────
# opencode/<provider>/<model>
# e.g. opencode/anthropic/claude-sonnet-4-5
# opencode/openai/gpt-4o
# ⚠️ Does NOT support tool-calling (Code Reviewer will fail).
#
# ── OpenCode Zen API (OPENCODE_ZEN_API_KEY required) ───────────────────
# opencode-zen/<model-id>
# e.g. opencode-zen/claude-sonnet-4-6 → Anthropic endpoint
# opencode-zen/gpt-5.3-codex → OpenAI endpoint (tool-calling ✅)
#
# ── OpenCode Go API (OPENCODE_ZEN_API_KEY required) ────────────────────
# opencode-go/<model-id>
# e.g. opencode-go/kimi-k2.5, opencode-go/qwen3.6-plus (tool-calling ✅)
# opencode-go/minimax-m2.7 → Anthropic endpoint (no tool-calling)
#
# ── Grok CLI (grok must be installed + authenticated) ──────────────────
# grok/<model-id>
# e.g. grok/grok-3, grok/grok-3-mini
# ⚠️ Does NOT support tool-calling (Code Reviewer will fail).
# Override binary path: GROK_BIN env var.
#
# ── Grok OAuth / xAI API (browser OAuth on first use) ──────────────────
# grok-oauth/<model-id>
# e.g. grok-oauth/grok-3, grok-oauth/grok-3-mini
# Tool-calling ✅ Override client ID: XAI_OAUTH_CLIENT_ID env var.
#
# ── NVIDIA NIM (NVIDIA_API_KEY required) ────────────────────────────────
# nvidia-nim/<model-id>
# e.g. nvidia-nim/meta/llama-3.1-70b-instruct
# nvidia-nim/mistralai/mistral-7b-instruct-v0.3
# Tool-calling ✅ Override endpoint: NVIDIA_NIM_BASE_URL env var.
#
# ── Alibaba DashScope / Qwen (DASHSCOPE_API_KEY required) ───────────────
# dashscope/<model-id>
# e.g. dashscope/qwen3-plus, dashscope/qwen3-turbo
# Tool-calling ✅ Override endpoint: dashscope_url in config.
#
# ── GitHub Copilot (COPILOT_OAUTH_TOKEN or ~/.copilot/config.json) ──────
# copilot/<model-id>
# e.g. copilot/gpt-4o, copilot/gpt-4.1-mini
# Tool-calling ✅ Requires active GitHub Copilot subscription.
#
# ── OpenAI API / BusinessChatGPT (OPENAI_API_KEY required) ─────────────
# openai/<model-id>
# e.g. openai/gpt-4o, openai/gpt-4.1-mini
# Tool-calling ✅
#
# ── OpenAI Codex CLI (ChatGPT Plus/Pro account required) ────────────────
# codex/<model-id>
# e.g. codex/codex-mini-latest, codex/o4-mini
# Install: curl -fsSL https://chatgpt.com/codex/install.sh | sh
# ⚠️ Does NOT support tool-calling (Code Reviewer will fail).
# Override binary path: CODEX_BIN env var.
model: "gpt-4.1"
# Per-agent model overrides
overrides:
product_manager: "gpt-4.1" # reasoning-heavy
pm_reviewer: "gpt-4.1"
architect: "gpt-4.1"
architect_reviewer: "gpt-4.1"
engineer: "gpt-4.1-mini" # runs many times — use cheaper model
code_reviewer: "gpt-4.1"
qa_planner: "gpt-4.1" # test planning needs strong reasoning
qa_engineer: "gpt-4.1-mini" # repetitive test writing — cheaper
deployment_tester: "gpt-4.1-mini"
github:
repo: "owner/repo" # default target repo
branch_prefix: "feature/agent"
team:
num_engineers: 2
agents: # enable / disable individual agents
product_manager: true
pm_reviewer: true
architect: true
engineer: true
code_reviewer: true
qa_planner: true
qa_engineer: true
deployment_tester: true
pipeline:
workspace_dir: "./workspace"
stop_on_review_issues: false
max_retries: 2
max_revisions: 3 # max automated PR revision rounds (0 = disabled)
mode: "standard" # "standard" | "tdd" | "blocky"
tdd_commit_tests: false # (TDD mode) commit test files to branch before implementation
# Note: if pipeline.yaml exists in the project root, it overrides pipeline.mode.
skills:
always_load: [] # e.g. [security-audit] to always apply
marketplace_repo: "" # e.g. "myorg/ai-software-house-skills"
cache_dir: "" # defaults to ~/.ai-software-house/skills/
fetch_timeout: 5
mcp:
servers: [] # see "Using MCP Servers" section below
cost_tracking:
enabled: false # set true to enable tracking
db_path: "./token_usage.db" # SQLite file path (relative to project root)
post_to_github: false # post usage summary comment to the GitHub issue
# Pricing per 1M tokens: [input_price_usd, output_price_usd]
# Set to [0.00, 0.00] for local/free models (Ollama, etc.)
# Unlisted models fall back to "default".
pricing:
gpt-4.1: [2.00, 8.00]
gpt-4.1-mini: [0.40, 1.60]
gpt-4o: [2.50, 10.00]
qwen3.6-plus: [0.50, 1.50]
qwen3.5-plus: [0.30, 1.20]
thinker: [0.00, 0.00]
thinker-best: [0.00, 0.00]
coder: [0.00, 0.00]
fast: [0.00, 0.00]
chat: [0.00, 0.00]
default: [2.00, 8.00] # fallback for any unlisted modelEvery agent's behaviour is controlled entirely by its role file (roles/<agent>.md). This file becomes the LLM's system prompt — change the markdown, change the agent.
# Agent Name
## Role
One or two sentences: who this agent is and what their job is.
Give them a name and a personality.
## Responsibilities
- Bullet list of what this agent does — these are the agent's "skills"
- Be specific: "Write a conftest.py with shared pytest fixtures"
- Not vague: "Write tests"
## Critical Rules
- Hard constraints that must never be violated
- e.g. "Never hardcode credentials — use environment variables"
- e.g. "Always use Given/When/Then format for acceptance tests"
- e.g. "Do NOT write test code — that is Edward's job"
## Output Format
The exact markdown/code structure the agent must produce.
Downstream parsers in the Python agent class look for specific markers.
Use code blocks showing the exact template.
## Quality Rules
- What makes a good output vs a bad one
- End with: `MY KEYWORD COMPLETE` ← used by the parser to detect success| Agent | Role File | Key Skills Defined |
|---|---|---|
| Product Manager | roles/product_manager.md |
Requirements analysis, user story writing, PRD structure |
| PM Reviewer | roles/pm_reviewer.md |
PRD completeness check, acceptance criteria quality, revision |
| Architect | roles/architect.md |
System design, module decomposition, tech stack selection |
| Arch Reviewer | roles/architect_reviewer.md |
Design critique, scalability review, revision |
| Engineer | roles/engineer.md |
Code generation, PEP 8, type hints, error handling |
| Code Reviewer | roles/code_reviewer.md |
Code quality, security, performance, verdict |
| QA Planner | roles/qa_planner.md |
Acceptance criteria, test strategy, module scenarios, Given/When/Then |
| QA Engineer | roles/qa_engineer.md |
pytest writing, mocking, conftest, runnable tests |
| Deployment Tester | roles/deployment_tester.md |
Docker compose, health checks, smoke tests |
Edit the role file — no code changes needed:
# Add security skills to the Engineer
nano roles/engineer.md## Security Skills
- Never hardcode credentials — always use environment variables
- Validate and sanitise all user input before processing
- Use parameterised queries — never concatenate SQL strings
- Set secure cookie flags; prefer HTTPS-only endpoints
- Flag any third-party packages with known CVEs in a commentStep 1 — Create the role file:
cat > roles/security_reviewer.md << 'EOF'
# Security Reviewer Agent
## Role
You are **Sam**, a Security Reviewer specialising in OWASP Top 10 vulnerabilities.
## Skills
- OWASP Top 10 vulnerability detection
- Secrets / credential leak detection
- SQL injection and XSS pattern recognition
- Dependency audit (flag known-vulnerable packages)
## Output Format
### SECURITY VERDICT: [PASS | WARN | FAIL]
#### Findings
| Severity | File | Line | Issue | Recommendation |
...
End with: `SECURITY REVIEW COMPLETE`
EOFStep 2 — Create the agent class:
# agents/security_reviewer.py
from .base_agent import BaseAgent
class SecurityReviewerAgent(BaseAgent):
role_name = "security_reviewer" # maps to roles/security_reviewer.md
def run(self, files: dict[str, str], prd: str) -> dict:
truncated = self.truncate_files(files, max_chars=10_000)
code = "\n\n".join(
f"### {path}\n```\n{content}\n```"
for path, content in truncated.items()
)
response = self.call(f"Review this code for security issues:\n\n{code}")
verdict = "FAIL" if "FAIL" in response else "WARN" if "WARN" in response else "PASS"
return {"review": response, "verdict": verdict}Step 3 — Register & wire in:
agents/__init__.py— add import and__all__entryorchestrator.py— instantiate, add stage, add field toPipelineResultconfig.yaml— add toagents.overridesmain.py— add toagent_map
| Goal | Where |
|---|---|
| Change personality / tone | roles/*.md — Role section |
| Add a new skill or check | roles/*.md — Responsibilities section |
| Make a rule stricter | roles/*.md — Critical Rules section |
| Change output structure | roles/*.md — Output Format section |
| Use a smarter/cheaper model | config.yaml → llm.overrides.<agent> |
| Change LLM temperature | agents/base_agent.py → temperature=0.3 in call() |
Run the pipeline automatically on this machine — no GitHub Actions required.
watcher.py polls GitHub hourly, finds unprocessed issues, and dispatches pipelines in parallel.
Every hour (or on --once for GitHub Actions):
For each repo in repos.yaml
→ Find open issues with a mapped label (e.g. ai-feature, ai-fix, ai-docs)
(that don't already have an agent-* state label)
→ Look up the pipeline YAML for that label
→ Label issue agent-queued
→ Run the pipeline in a thread (bounded by max_parallel + per-repo parallel_issues)
→ On success: inspect result for chaining conditions (see below)
→ If tests failed → apply chaining label (e.g. ai-fix) instead of agent-complete
→ If code reviewer requested changes → apply chaining label
→ Otherwise → label agent-complete
→ On failure: label agent-failed + post error comment
State labels (auto-created in your repo):
| Label | Meaning |
|---|---|
agent-queued |
Picked up this run, pipeline starting |
agent-running |
Pipeline actively running |
agent-complete |
✅ Pipeline finished successfully — no follow-up needed |
agent-failed |
❌ Pipeline failed — remove label to retry |
The watcher also monitors open pull requests for update-branch directives. When a human (or bot) posts a PR comment containing the phrase update branch (or similar), the watcher automatically merges the base branch into the PR head branch.
Flow:
PR comment detected → "update branch"
→ watcher calls GitHub merge-base API
→ If 200/204 (clean merge): posts ✅ comment on PR
→ If 409 (conflict detected):
→ ConflictResolverAgent clones repo locally
→ git merge origin/<base> writes real conflict markers
→ LLM resolves each conflicting file using PR title/body as context
→ resolved branch committed + pushed
→ GitHub merge API retried (now succeeds)
→ posts ✅ resolved comment on PR
→ If resolution fails: posts ❌ comment listing unresolved files
Configure the conflict resolver model in repos.yaml:
watchers:
- tracker_repo: wanleung/my-app
conflict_resolver_model: "gpt-4o" # optional; falls back to senior_model then modelA stronger model (e.g.
gpt-4o,claude-3-opus) is recommended for conflict resolution — the agent must understand the PR's intent to resolve ambiguous conflicts correctly.
When a run completes with issues (failing tests, reviewer changes requested), the watcher automatically swaps the completion label for a follow-up trigger label — no human needed to re-queue.
Issue #42 (ai-feature)
→ tests fail
→ watcher adds ai-fix instead of agent-complete
→ posts comment explaining the chain
→ next watcher cycle picks up ai-fix → runs fix pipeline
→ tests pass → adds agent-complete ← done
Configure in config.yaml:
pipeline:
chaining:
on_test_failure: "ai-fix" # label to apply when tests fail
on_review_issues: "ai-fix" # label to apply when reviewer requests changes
# set to ~ (null) to disable a rule📖 See
docs/operations-guide.md§ 6 for full details, priority rules, and how to setnext_labelfrom a custom stage.
watchers:
- tracker_repo: wanleung/ai-software-house # where issues are filed
default_target: wanleung/my-app # default target repo for code
parallel_issues: 2 # max simultaneous issues for this repo
labels:
ai-feature: ai-feature # label → pipeline name (matches pipelines/ai-feature.yaml)
ai-fix: ai-fix
ai-docs: ai-docs
enabled: true
- tracker_repo: wanleung/another-project # watch a second repo
default_target: ~ # null = same repo as tracker
enabled: true
settings:
max_parallel: 3 # global cap across all repos
num_engineers: 2
model: "gpt-4.1"
log_dir: ./logs/watcherLegacy
feature_label/bug_label/doc_labelfields are still supported — they are automatically mapped toai-feature,ai-fix, andai-docspipelines respectively.
Use
**Target repo:** owner/repoin the issue body to route code to a different repo thandefault_target.
Each repo entry can declare its own llm: block that deep-merges on top of the global config.yaml LLM config. This lets different projects use different models, fallbacks, or concurrency limits — with no change to the global config.
# repos-available/my-ml-project.yaml
tracker_repo: wanleung/my-ml-project
labels:
ai-feature: ai-feature
llm:
model: "claude-3-5-sonnet-20241022" # use Claude for this repo only# repos-available/my-app.yaml
tracker_repo: wanleung/my-app
labels:
ai-feature: ai-feature
ai-fix: ai-fix
llm:
model: "openai/gpt-4.1" # default model for this repo
overrides:
architect: "openai/gpt-4.1" # strong model for design work
engineer: "openai/gpt-4.1-mini" # cheaper model for repetitive coding
qa_engineer: "ollama/qwen2.5-coder" # run QA locally
fallbacks:
- model: "openai/gpt-4.1-mini" # if primary fails, fall back here
- model: "ollama/llama3.2" # final fallback: local Ollama
pools:
openai: 3 # allow 3 concurrent calls to OpenAI for this repo
ollama: 1 # keep local Ollama at 1 (default)| Key | Behaviour |
|---|---|
model |
Repo value replaces global |
overrides |
Key-by-key merge — repo agent wins, others keep global |
pools |
Key-by-key merge — repo backend wins, others keep global |
fallbacks |
Repo list replaces global list entirely |
The global config.yaml is never mutated — each repo gets its own deep copy.
You can store each repo's config in a separate file and symlink or drop them into repos-available/:
repos-available/
my-app.yaml # enabled
my-ml-project.yaml # enabled
old-service.yaml # disabled (enabled: false)
repos.yaml auto-discovers all .yaml files in repos-available/ and merges them with any inline watchers: entries. The llm: block in each file is scoped to that repo.
Note: The
--onceCLI path and DLQ retries always use the global LLM config — per-repollm:only applies to watcher-dispatched pipelines.
Label an issue ai-docs to trigger a lightweight doc-update pipeline — no PM, Architect, or Engineers involved.
The built-in pipeline is defined in pipelines/ai-docs.yaml and runs two stages:
doc_generate— reads existing docs + source from the target repo, writes/updates documentation filesdoc_commit_pr— commits the files to a branchdoc/<issue-number>-<slug>and opens a PR referencing and closing the issue
Issue body format:
Update the README installation section and add a troubleshooting guide.
**Docs:** README.md, docs/troubleshooting.md
**Target repo:** owner/my-app
**Docs:**is optional — if omitted, the agent auto-discovers.mdfiles in the repo (up to 5)**Target repo:**is optional — if omitted, the watcher's repo is used
Label an issue pr-campaign to run a 3-stage content creation pipeline that produces a fully formatted campaign proposal and opens a GitHub PR for human review.
Built-in pipeline: pipelines/pr-campaign.yaml
| Stage | Agent | Output |
|---|---|---|
pr_analyst |
Alex — PR Analyst | Structured research: opportunity, audience, angle, channels, risks |
pr_creative |
Casey — PR Creative | 3–5 campaign concepts with platform tactics (LinkedIn, Instagram, TikTok, X) and ready-to-post social copy |
pr_proposal |
Jordan — PR Proposal | Polished Markdown proposal + PR metadata (title, body) |
The watcher monitors wanleung/pr-campaigns for issues labelled pr-campaign.
Create an issue there with your campaign brief and apply the label:
## Campaign Brief
**Product / Feature:** ai-dev-team v0.10.0 — per-repo LLM config
**Key message:** Every project can now pick its own AI model without touching global config.
**Target audience:** Indie hackers, small dev teams, local-LLM enthusiasts.
**Goal:** GitHub stars and community awareness.
**Preferred channels:** X, LinkedIn, Reddit (r/LocalLLaMA, r/MachineLearning)The pipeline runs automatically. When it finishes, a new PR appears on wanleung/pr-campaigns containing the full proposal document.
python main.py --pipeline pr-campaign \
"Launch campaign for v0.10.0: per-repo LLM config for indie devs." \
--repo wanleung/pr-campaignsOr from a brief file:
python main.py --pipeline pr-campaign --file brief.txt --repo wanleung/pr-campaignswatchers:
- tracker_repo: your-org/pr-campaigns
default_target: your-org/pr-campaigns
parallel_issues: 1
labels:
pr-campaign: pr-campaign
enabled: trueAny pipeline can include a brainstorm discussion stage that runs a moderated multi-agent debate before engineering work begins. Three personas — Analyst, Skeptic, and Optimist — independently think through the problem, then debate across up to 2 rounds, with the Moderator synthesising the outcome.
Preset file: discussions/brainstorm.yaml
Auto-registered stage name: discuss_brainstorm (all discussions/*.yaml files are auto-discovered)
# pipelines/my-feature.yaml
stages:
- pm
- pm_reviewer
- architect
- discuss_brainstorm # ← insert here, after design, before engineering
- reviewer
- junior_engineer
- senior_engineer
- validation_gate
- qa_planner
- qa_engineer| Phase | What happens |
|---|---|
| Homework | Each participant thinks independently and writes their initial analysis |
| Rounds 1–2 | Open discussion — participants can @mention each other to respond directly |
| Early exit | Stops before max rounds if moderator signals CONSENSUS_REACHED |
| Output | discussion_transcript + discussion_synthesis injected into all downstream stages |
Downstream agents (reviewer, engineers, QA) receive the full transcript and synthesis, so they understand the reasoning behind design choices without re-deriving them.
Create an issue with your feature or architecture brief and apply whatever label maps to your pipeline:
## Feature Brief
Add webhook support to the notification service so external systems can subscribe
to job completion events.
**Acceptance criteria:**
- POST /webhooks to register an endpoint
- Events fired on job status change (queued, running, done, failed)
- Retry on delivery failure (3 attempts, exponential backoff)The brainstorm stage will debate the approach before the architect and engineers write any code.
Drop a YAML file into discussions/ and it's automatically available as a stage:
# discussions/architecture-review.yaml
participants:
- role: backend_expert
persona_file: roles/senior_engineer.md
- role: security_reviewer
persona_file: roles/code_reviewer.md
homework_round: true
max_rounds: 3
early_exit: CONSENSUS_REACHED
moderator:
persona_file: roles/moderator.md
output_mode: both # transcript + synthesis passed downstream
context_fields:
- spec
- design
- issue_bodyUse it in any pipeline as discuss_architecture_review.
Participants are resolved from any role file in roles/, an inline persona string, or chosen automatically by the LLM. All three approaches can be mixed in the same preset.
1 — Fixed list (any existing agent role file):
participants:
- role: architect
persona_file: roles/architect.md # any roles/*.md file works
- role: code_reviewer
persona_file: roles/code_reviewer.md
- role: qa_engineer
persona_file: roles/qa_engineer.md2 — Auto-select from a pool (LLM picks the best fit for each issue):
auto_participants:
pool:
- architect
- senior_engineer
- code_reviewer
- qa_engineer
- security_reviewer
select: 3 # LLM reads the issue and picks the 3 most relevant rolesThe LLM sees the issue body and chooses who adds the most value — e.g. for a DB schema change it might pick architect, senior_engineer, qa_engineer and skip security_reviewer.
Fallback: if the LLM response contains no valid role names, participants is used instead.
3 — Inline persona (no role file needed):
participants:
- role: domain_expert
persona: |
You are a domain expert in financial regulations.
Focus on compliance risks and regulatory constraints.Per-participant model override — give a specific participant a different LLM:
participants:
- role: architect
persona_file: roles/architect.md
llm: "openai/gpt-4.1" # strong model for design reasoning
- role: junior_engineer
persona_file: roles/junior_engineer.md
llm: "ollama/qwen2.5-coder" # cheap local model for implementation perspectiveTwo-model split: fast discussion + slow homework with tool calling
Discussion agents do not call tools during debate rounds — they only reason. But the homework round is different: participants need to research the codebase and prior decisions before forming an opinion. To support this, each participant can use a separate, more capable model for homework that has access to the RAG tool_registry.
Use homework_llm to give each participant a different model for the research phase vs the debate phase:
participants:
- role: analyst
persona_file: roles/analyst.md
llm: "opencode-go/qwen3.6-plus" # fast — discussion rounds (no tools needed)
homework_llm: "opencode-go/qwen3.5-plus" # slow+tools — homework research round| Phase | Model used | Tools available |
|---|---|---|
| Homework | homework_llm (slow, capable) |
✅ search_codebase, search_memory, search_docs |
| Discussion rounds | llm (fast) |
❌ pure reasoning only |
- If
homework_llmis not set,llmis used for all phases (no tools) - If
homework_llmis set but the model doesn't support tool calling, it falls back to a plain call automatically - This pattern is useful when you have a fast model without tool support and a slower model with tool support in the same LLM pool
chmod +x setup_cron.sh
./setup_cron.shOr manually:
crontab -e
# Add this line (use the venv python directly — 'source activate' breaks in cron's /bin/sh):
0 * * * * cd /home/you/ai-software-house && venv/bin/python watcher.py >> logs/watcher/cron.log 2>&1# Dry run — shows what would run, makes no GitHub changes
python watcher.py --dry-run
# Run once immediately (same as GitHub Actions mode)
python watcher.py --once
# Keep running (polls in a loop — use for cron replacement)
python watcher.py
# Use a different config file
python watcher.py --config my-other-repos.yamllogs/watcher/
cron.log ← all cron runs (appended)
watcher-YYYYMMDD.log ← daily watcher log
issue-42-20260322-140000.log ← per-issue pipeline output
A lock file (.watcher.lock) is created at startup and removed on exit.
If a run is still active when the next cron fires, the new run exits immediately.
Stale locks (>1 hour old) are cleared automatically.
Start the standalone metrics server, then set metrics_url in watchers.yml:
# Start the metrics server (default port 9091)
METRICS_PORT=9091 python3 metrics_server.py# watchers.yml
settings:
metrics_url: http://localhost:9091 # watcher will POST events hereThe server exposes counters at GET /metrics for Prometheus scraping:
| Metric | Labels | Description |
|---|---|---|
aisw_circuit_breaker_events_total |
name, state |
Circuit-breaker state transitions |
aisw_dlq_events_total |
action, backend |
Dead-letter-queue operations |
aisw_degradation_events_total |
trigger |
Degradation policy activations |
Events are posted fire-and-forget from a daemon thread — the watcher is never blocked by a slow or unavailable metrics server.
The pipeline runs automatically when you label a GitHub Issue.
# 1. Add GH_TOKEN secret (classic PAT, NOT fine-grained, NOT GITHUB_TOKEN)
# Go to: Settings → Secrets → Actions → New repository secret
# Name: GH_TOKEN Value: ghp_your_classic_pat
# 2. Set up labels
gh workflow run setup-labels.yml
# 3. (Optional) Set target repo for cross-repo builds
# Add secret: TARGET_REPO = owner/target-repo-nameCreate a GitHub Issue, then add the ai-feature label:
Title: Patient questionnaire mobile app
## Description
Build iOS and Android apps for rectal cancer patient questionnaires.
**Target repo:** wanleung/my-mobile-app
## Acceptance Criteria
- Patient can complete a questionnaire offline
- Data syncs when connectivity is restored
- Clinician dashboard shows aggregated resultsThe
**Target repo:** owner/repoline routes the code to a different repository. Tracking issues (PRD, reviews) stay in theai-software-houserepo.
Create an issue with the ai-fix label:
Title: Login fails for users with special characters in email
Steps to reproduce:
1. Register with email: user+test@example.com
2. Attempt to login
3. Error: 500 Internal Server Error
Expected: Successful loginAfter humans post review comments on an AI-generated PR, trigger the revision pipeline:
Option A — GitHub UI:
Go to Actions → 🔄 AI PR Feedback Loop → Run workflow, enter the PR number and target repo.
Option B — CLI:
python main.py --mode revise --pr 42 --repo owner/target-repoOption C — API (repository_dispatch):
gh api repos/owner/ai-software-house/dispatches \
--method POST \
-f event_type=ai-pr-revise \
-f client_payload[pr_number]=42 \
-f client_payload[target_repo]=owner/target-repoThe pipeline reads all non-bot review comments, re-runs Engineer → Code Reviewer → QA, commits the updated code to the same branch, and posts a ✅ Revision N complete comment on the PR.
Maximum revision rounds is controlled by pipeline.max_revisions in config.yaml (default: 3).
issue-watcher.yml runs every 15 minutes and picks up any open issues with ai-feature or ai-fix labels that haven't been queued yet (lack the ai-queued label). This catches issues created programmatically with labels already attached, where the native label-trigger doesn't fire.
Deduplication: once an issue is picked up, the watcher adds the ai-queued label — preventing double-triggering by both the watcher and the native trigger.
| Workflow | Trigger | What it does |
|---|---|---|
feature-build.yml |
Issue labelled ai-feature |
Full feature pipeline (PM → QA) via watcher.py --once |
bug-fix.yml |
Issue labelled ai-fix |
Bug fix pipeline (Diagnose → Bug Fix) via watcher.py --once |
pr-feedback.yml |
Manual / repository_dispatch |
Engineer → Code Reviewer → QA revision loop |
issue-watcher.yml |
Cron every 15 min | Finds unqueued issues and triggers the above |
run-tests.yml |
PR opened/updated | Runs pytest + docker smoke tests |
setup-labels.yml |
Manual dispatch | Creates required labels |
ai-software-house/
├── main.py # CLI entry point — full pipeline + --pipeline / --list-pipelines
├── watcher.py # Hourly cron poller + GitHub Actions entry point (--once mode)
├── orchestrator.py # Full pipeline (13 stages)
├── github_client.py # GitHub API wrapper (Issues, PRs, commits)
├── memory_store.py # Tiered SQLite memory store (run/monthly/quarterly)
├── repo_context.py # RepoContextLoader (tree injection) + RepoAutoIndexer (RAG auto-index)
├── skills_loader.py # SkillLoader — detects + injects role-scoped skills per agent
├── watcher.py # Hourly cron poller — dispatches pipelines for new issues
├── repos.yaml # Repos to watch + parallel/model settings
├── setup_cron.sh # One-command cron job installer
├── config.yaml # LLM models, team size, pipeline settings
├── requirements.txt
│
├── agents/
│ ├── base_agent.py # BaseAgent: call(), call_with_tools(), retry, truncation
│ ├── product_manager.py # Alice — PRD writer
│ ├── pm_reviewer.py # Grace — PRD reviewer
│ ├── architect.py # Bob — system designer
│ ├── architect_reviewer.py # Frank — design reviewer
│ ├── engineer.py # Alex — code writer (parallel)
│ ├── code_reviewer.py # Carol — code reviewer [tools: run_linter]
│ ├── qa_planner.py # Henry — test planner [tools: search_github_issues]
│ ├── qa_engineer.py # Edward — test writer
│ ├── deployment_tester.py # Diana — deployment tester
│ ├── summariser.py # Writes compact memory entries after each run
│ ├── refactor_agent.py # Analyses and rewrites code in dream mode
│ └── memory_consolidator.py # Consolidates N run summaries into snapshots
│ └── conflict_resolver.py # ConflictResolverAgent — git-clone, 3-way merge, LLM resolution
│
├── roles/ # Agent skills & guides (system prompts)
│ ├── product_manager.md
│ ├── pm_reviewer.md
│ ├── architect.md
│ ├── architect_reviewer.md
│ ├── engineer.md
│ ├── code_reviewer.md
│ ├── qa_planner.md
│ ├── qa_engineer.md
│ ├── deployment_tester.md
│ ├── summariser.md
│ ├── refactor_agent.md
│ └── memory_consolidator.md
│ └── conflict_resolver.md # System prompt for ConflictResolverAgent
│
├── tools/ # Tool calling — Option A (MCP-ready)
│ ├── registry.py # ToolRegistry ABC + LocalToolRegistry (@tool decorator)
│ ├── builtin.py # Built-in tools: run_linter, run_shell_command,
│ │ # search_github_issues, get_github_file
│ └── __init__.py
│
├── .github/workflows/
│ ├── feature-build.yml # Auto-trigger on 'feature-request' label
│ ├── bug-fix.yml # Auto-trigger on 'bug' label
│ ├── run-tests.yml # Run pytest + docker on PRs
│ └── setup-labels.yml # Create required issue labels
│
└── workspace/ # Generated code written here locally
└── <project-name>/
├── checkpoint.json # Resume state
├── memory.db # SQLite memory store (run/monthly/quarterly tiers)
├── memory.md # Human-readable memory log
├── src/ # Generated source files
└── tests/ # Generated test files
Skills are markdown files in skills/ that inject domain-specific guidance into agent prompts at runtime. Each skill targets specific roles (e.g. Architect gets architecture guidance, Engineer gets implementation rules) and is auto-detected from context or always-loaded from config.
- Detection — At the start of each run,
SkillLoaderscans the issue body and repo languages for tag matches (e.g. issue mentions "flutter" →skills/flutter.mdis loaded) - Role scoping — Each skill file has a section per role:
## For Engineers,## For Architects, etc. Only the relevant section is injected per agent - Injection — Matched skill blocks are prepended to each agent's system prompt as a
## Skills Loadedblock
Tech-stack skills — auto-detected from repo languages and issue keywords:
| Skill | File | Auto-detects on |
|---|---|---|
| Flutter | skills/flutter.md |
flutter, dart, mobile, riverpod, drift |
| FastAPI | skills/fastapi.md |
fastapi, python, api, pydantic, sqlalchemy |
| React | skills/react.md |
react, typescript, frontend, nextjs, vite |
| Security Audit | skills/security-audit.md |
security, auth, jwt, oauth |
| Docker | skills/docker.md |
docker, container, kubernetes, helm |
Process skills — distilled engineering best-practices; auto-detected or always-loaded:
| Skill | File | Roles | Auto-detects on |
|---|---|---|---|
| TDD | skills/tdd.md |
Engineer, Code Reviewer, QA Engineer | tdd, testing, pytest, jest |
| Debugging | skills/debugging.md |
Engineer, QA Engineer | debugging, bug-fix, triage |
| API Design | skills/api-design.md |
Architect, Engineer, Code Reviewer, Arch Reviewer | api, rest, interface, contract |
| Incremental Implementation | skills/incremental-implementation.md |
Engineer, Code Reviewer | implementation, slicing, incremental |
| Code Review Quality | skills/code-review-quality.md |
Code Reviewer, Arch Reviewer | code-review, quality |
| Source-Driven Dev | skills/source-driven.md |
Architect, Engineer, Code Reviewer, Arch Reviewer | documentation, frameworks, sources |
| Architecture Decision Records | skills/adrs.md |
Architect, Arch Reviewer | adr, architecture, decisions |
---
name: my-skill
description: Brief description
version: 1.0.0
roles:
architect: true
engineer: true
code_reviewer: true
qa_engineer: true
product_manager: false
architect_reviewer: false
pm_reviewer: false
tags: [my-tag, another-tag] # matched against issue body + repo languages
source: local
---
# My Skill
## For Architects
Architecture-level guidance here.
## For Engineers
Implementation rules here.
## For Code Reviewers
What to look for in code reviews.
## For QA Engineers
What to test.Save it as skills/my-skill.md. It will be auto-detected whenever my-tag appears in the issue body.
skills:
# Always load these skills regardless of project context
always_load: [security-audit]
# Remote marketplace repo (leave empty to use local only)
marketplace_repo: ""
# Marketplace cache dir (defaults to ~/.ai-software-house/skills/)
cache_dir: ""
fetch_timeout: 5Add a line in the GitHub issue:
skills: docker, security-audit
python main.py --update-skillsAgents can call tools during their reasoning — not just produce text. The tool-call loop runs automatically inside BaseAgent.call_with_tools().
Agent prompt
↓
LLM decides to call a tool → tool executes → result appended to messages
↓ (repeat until no more tool calls)
Final text response
| Tool | Used by | What it does |
|---|---|---|
run_linter |
Code Reviewer | Runs ruff on Python files — concrete lint errors in the review |
run_shell_command |
Any agent | Runs a safe shell command (pytest, syntax check, etc.) |
search_github_issues |
QA Planner | Searches GitHub issues for existing ACs / related bugs |
get_github_file |
Any agent | Reads a file from a GitHub repo at runtime |
from tools import LocalToolRegistry
my_tools = LocalToolRegistry()
@my_tools.tool(
name="check_dependencies",
description="Check if a Python package exists on PyPI",
parameters={
"type": "object",
"properties": {
"package": {"type": "string", "description": "Package name"},
},
"required": ["package"],
},
)
def check_dependencies(package: str) -> str:
import requests
r = requests.get(f"https://pypi.org/pypi/{package}/json", timeout=5)
return f"Found: {r.json()['info']['version']}" if r.ok else "Not found"
# Use in any agent
response = agent.call_with_tools("Check if fastapi exists", tools=my_tools)Configure MCP servers in config.yaml under the mcp.servers key. Tools from all configured servers are automatically merged with the built-in tools and passed to the Code Reviewer and QA Planner agents.
mcp:
servers:
- name: github
type: stdio
command: npx
args: ["-y", "@modelcontextprotocol/server-github"]
env:
GITHUB_TOKEN: "${GITHUB_TOKEN}" # expanded from env at runtime
- name: my-search
type: sse
url: "https://mcp.example.com/sse"
headers:
Authorization: "Bearer ${MCP_API_KEY}"Server types:
| Type | Key fields | Notes |
|---|---|---|
stdio |
command, args, env |
Spawns a local subprocess (e.g. npx, python) |
sse |
url, headers |
Connects to a remote HTTP/SSE endpoint |
${VAR} expansion — any value in env or headers can reference an environment variable as ${MY_VAR}. Unknown variables are left unexpanded.
Name collisions — if two servers expose a tool with the same name, the second is prefixed: servername__toolname.
Install: pip install mcp (or add mcp>=1.0.0 to requirements.txt — already included).
⚠️ MCP tool-calling requires a tool-calling-capable backend. TheopencodeCLI backend does not support tool calls. Usegithub_models,anthropic,opencode-zen/(non-Claude), oropencode-go/(non-MiniMax) backends.
Before the Engineer stage runs, the pipeline gives agents awareness of the existing codebase so they build on what's already there rather than re-inventing it.
| Repo size | Strategy | What agents see |
|---|---|---|
| Small (RAG not configured) | Tree injection | Full git ls-tree file listing injected into PM, Architect, PM Reviewer, and Arch Reviewer prompts |
| Large (RAG enabled) | Auto-index | RepoAutoIndexer downloads the repo zip, indexes it into the RAG codebase collection, then Engineer/QA agents query it via search_codebase |
RepoContextLoader fetches the repository file tree from GitHub and injects it as a fenced block into the four planning-stage agents. The tree is idempotent — re-running the pipeline will not inject it twice.
RepoAutoIndexer runs a _stage_repo_index step immediately before the Engineer stage:
- If a local
repo_dirpath is configured → uses it directly - Otherwise → downloads the GitHub repo zip → extracts to a temp dir
- Runs
rag-mcp/indexer.py --source codebase --path <extracted> --clean - The index stage is checkpoint-guarded — it will not re-index on pipeline resume
This stage is skipped entirely when rag_registry is not configured in config.yaml.
The RAG MCP server gives Engineer, Architect, and QA Engineer agents the ability to search an indexed pgvector knowledge base before generating code, designs, or tests. This improves consistency with existing patterns and surfaces relevant documentation at the right moment.
User requirement
↓
Architect/Engineer/QA agent receives task
↓
Agent calls search_memory / search_codebase / search_docs → relevant chunks returned
↓
Agent incorporates retrieved context into its response
Prerequisites: PostgreSQL with the pgvector extension, and one of: Ollama (local), vLLM, or an OpenAI-compatible embedding endpoint.
Step 1 — Start the RAG server
cd rag-mcp
# Copy and edit the env file (set your Ollama host at minimum)
cp .env.example .env
# edit .env
docker compose up -dThe bundled pgvector/pgvector:pg16 container starts automatically. Data is stored in a named Docker volume (pgdata) so it survives restarts and docker compose down.
Step 2 — Apply the migration
Runs once on first start (or after wiping the volume):
docker compose exec postgres psql -U rag rag \
-f /dev/stdin < migrations/001_create_rag_embeddings.sqlStep 3 — Index your codebase
source venv/bin/activate
cd rag-mcp
# Index the codebase (Python files)
DATABASE_URL=postgresql://rag:ragpassword@localhost:5432/rag \
EMBED_BACKEND=ollama OLLAMA_BASE_URL=http://your-ollama:11434 OLLAMA_MODEL=nomic-embed-text \
python indexer.py codebase /path/to/your/repo --clean
# Index docs (markdown / text files)
python indexer.py docs /path/to/docs/
# Index agent memory files
python indexer.py memory /path/to/memory/Step 4 — Enable in config.yaml
Uncomment the RAG entry in the mcp.servers section:
mcp:
servers:
- name: rag
type: http
url: "http://localhost:8001/mcp"That's it — Engineer, Architect, and QA Engineer will automatically use RAG search tools when responding.
| Backend | EMBED_BACKEND |
Required env vars | Notes |
|---|---|---|---|
| Ollama (local) | ollama |
OLLAMA_BASE_URL, OLLAMA_MODEL |
Recommended for local setup; use nomic-embed-text |
| vLLM | vllm |
VLLM_BASE_URL, VLLM_MODEL |
Faster than Ollama; same API shape |
| OpenAI | openai |
OPENAI_API_KEY, OPENAI_EMBED_MODEL |
Requires internet; defaults to text-embedding-3-small |
| Tool | Used by | Searches |
|---|---|---|
search_codebase |
Engineer, QA Engineer | Source code chunks — finds existing implementations, patterns |
search_memory |
Architect | Past designs, summaries — avoids repeating past decisions |
search_docs |
Engineer, Architect | Documentation, markdown files |
search_standards |
Architect | Coding standards, design patterns, guidelines |
Re-run indexer.py any time your codebase changes. Use --clean to remove stale embeddings for deleted files:
python indexer.py codebase /your/repo --clean
# Index coding standards / architectural guidelines
python indexer.py --source standards --path ./standards/ --cleanUse docs for all library and API documentation — semantic search naturally surfaces the right language/library from the query, so you don't need a separate source type per language:
# All lib docs go into the same 'docs' index
python indexer.py docs /docs/cpp-stl/
python indexer.py docs /docs/python-stdlib/
python indexer.py docs /docs/react-native/
python indexer.py docs /docs/rust-std/
python indexer.py docs /docs/java-sdk/Use standards for specifications, protocols, and architectural guidelines — content agents reference when making design decisions rather than implementation lookups:
# Specs and standards
python indexer.py standards /docs/rfcs/
python indexer.py standards /docs/ercs/
python indexer.py standards /docs/company-guidelines/💡 Rule of thumb: only create a new source type if you need a distinct MCP tool so agents can explicitly choose to search that collection (e.g. a
search_api_specstool for a dedicated API-spec agent). For general reference material,docsorstandardscovers the vast majority of use cases.
curl http://localhost:8001/health
# {"status": "ok"}The postgres data lives in a named Docker volume (pgdata). To move to a new machine:
# On old machine — dump
docker compose exec postgres pg_dump -U rag rag > rag_backup.sql
# On new machine — start fresh stack then restore
docker compose up -d
cat rag_backup.sql | docker compose exec -T postgres psql -U rag rag💡 Alternatively, just re-run
indexer.pyon the new machine — embeddings are deterministic for the same model, so re-indexing is often simpler than migrating.
⚠️ RAG tool calls usecall_with_tools()internally. TheopencodeCLI backend does not support tool calls — RAG will silently fall back to non-RAG mode for agents using that backend.
A source type is a named collection in the knowledge base (e.g. codebase, docs, standards). Each source type has:
- An indexer function in
rag-mcp/indexer.pythat chunks and stores content - A search tool in
rag-mcp/main.pythat agents call at runtime
Copy the pattern from index_standards() or index_docs(). Change source_type= to your new name:
def index_mytype(path: str, embedder: Embedder, extensions=None, clean=False):
exts = {f".{e.lstrip('.')}" for e in (extensions or ["md", "txt"])}
root = Path(path)
live_ids = []
for fpath in sorted(root.rglob("*")):
if fpath.suffix not in exts or not fpath.is_file():
continue
source_id = str(fpath)
live_ids.append(source_id)
try:
text = fpath.read_text(errors="replace")
except OSError as exc:
log.warning("Skipping %s: %s", fpath, exc)
live_ids.pop()
continue
for i, chunk in enumerate(chunk_text(text)):
try:
embedding = embedder.embed(chunk)
except EmbedderError as exc:
log.warning("Skipping %s chunk %d: %s", source_id, i, exc)
continue
upsert_chunk(
source_type="mytype", # ← your new name
source_id=source_id,
chunk_index=i,
content=chunk,
embedding=embedding,
metadata={"path": source_id},
)Then add "mytype" to --source choices in main() and call your function.
@mcp.tool()
async def search_mytype(query: str, top_k: int = _TOP_K) -> str:
"""Search <describe what this collection contains and when to use it>.
The description is what the agent reads to decide whether to call this tool.
"""
try:
top_k = max(1, min(top_k, _MAX_TOP_K))
embedding = await asyncio.to_thread(_embedder.embed, query)
results = await asyncio.to_thread(search_chunks, "mytype", embedding, top_k)
return json.dumps({"results": [r.model_dump() for r in results]})
except EmbedderError as exc:
return json.dumps({"error": str(exc), "results": []})
except Exception as exc:
return json.dumps({"error": str(exc), "results": []})Agents automatically see all tools on the RAG MCP server. Control access by which agents receive the tool_registry:
- All RAG-enabled agents (Engineer, Architect, QA Engineer) get it automatically if the RAG server is enabled in
config.yaml - To restrict to specific agents only, create a separate MCP server entry in
config.yamlpointing to a second RAG instance indexed with only that source type
cd rag-mcp
DATABASE_URL=... EMBED_BACKEND=ollama OLLAMA_BASE_URL=... OLLAMA_MODEL=nomic-embed-text \
python indexer.py --source mytype --path /path/to/content --cleancd rag-mcp && docker compose up -d --buildThe new search_mytype tool is immediately available to all RAG-enabled agents.
This project uses the same AI backend as GitHub Copilot CLI:
| GitHub Copilot CLI | AI Software House | |
|---|---|---|
| AI Model | GitHub Models API | GitHub Models API |
| Authentication | Classic PAT (ghp_…) |
Classic PAT (ghp_…) |
| API Endpoint | models.inference.ai.azure.com |
models.inference.ai.azure.com |
| Usage | Interactive terminal assistant | Automated multi-agent pipeline |
| Token scope | copilot |
repo (classic PAT) |
This project demonstrates how GitHub's infrastructure — Models API, Issues, Pull Requests, Actions — can be wired together into a fully automated software development team. Each agent is a thin Python wrapper around a single LLM call; the orchestrator handles sequencing, checkpointing, and GitHub integration.
The role files (roles/*.md) are the heart of the system. They encode domain knowledge, output contracts, and quality rules — making it easy to specialise, tune, or extend any agent without touching Python code.
The engineer agent automatically receives framework-specific documentation in its prompt based on the project it's working on.
-
AGENTS.md / CLAUDE.md detection: Before writing code, the engineer walks up the project directory tree looking for
AGENTS.md(preferred) orCLAUDE.md. If found, its content is prepended to the prompt. -
Framework detection: The orchestrator checks
config.yaml'sframework_docs.frameworkslist. Each entry defines glob patterns to detect the framework. If any pattern matches a file in the project directory, the framework's summary is included. -
Bundled docs: For frameworks that ship bundled docs (e.g., Next.js ships docs in
node_modules/next/dist/docs/), those are also read and included (up to a character cap).
| Framework | Detection | Notes |
|---|---|---|
| Next.js | package.json, next.config.* |
Also reads bundled docs from node_modules/next/dist/docs/ |
| Nuxt 3 | nuxt.config.* |
|
| React Native | app.json, metro.config.* |
|
| Flutter | pubspec.yaml |
|
| FastAPI | requirements*.txt, pyproject.toml |
|
| Django | manage.py |
Add an entry to framework_docs.frameworks in config.yaml:
framework_docs:
frameworks:
- name: my-framework
detect:
- "my-framework.config.*"
summary: |
Key conventions for my-framework...
bundled_docs_path: "node_modules/my-framework/docs" # optionalTo disable framework doc injection entirely, remove the framework_docs key from config.yaml, or set:
framework_docs:
check_agents_md: false
frameworks: []Each repo can independently choose how deployment smoke tests run by adding a deploy: block to its repos-available/*.yaml file:
# repos-available/my-repo.yaml
# Local docker-compose smoke tests (default if deploy: block is absent)
deploy:
mode: docker
compose_file: docker-compose.test.yml # optional, default shown
timeout_s: 300 # optional, default shown
# Remote VM via libvirt (SSH + virt-install + CoW overlay)
deploy:
mode: libvirt
virt_host: ubuntu@192.168.1.10 # required: SSH address of libvirt host
base_image: /var/lib/libvirt/images/ubuntu-24.04.qcow2 # required: read-only base image
vm_user: ubuntu # default: ubuntu
ssh_key: ~/.ssh/id_ed25519 # default: SSH agent
vcpus: 2 # default: 2
ram_mb: 2048 # default: 2048
teardown: always # always | on_pass | keep (default: always)
timeout_s: 600 # default: 600
# Skip deployment testing entirely
deploy:
mode: nonemode: libvirt provisions a fresh VM from a CoW overlay of base_image (the base image is never modified), rsyncs the project into /opt/app/, runs tests/test_deployment.py via SSH ProxyJump through virt_host, then tears down based on teardown. Multiple repos safely share the same base_image — each run gets its own isolated overlay.
Teardown modes:
always— destroy VM after every run (default, safest)on_pass— keep VM alive when tests fail (useful for debugging via SSH)keep— never destroy (manual cleanup required)
aisw_server.py exposes the pipeline as a REST API and MCP server, so Copilot CLI, Claude Code, OpenCode, web UIs, and curl can trigger and monitor pipelines without touching GitHub labels.
# Start the server
python aisw_server.py
# Submit a requirement via curl
curl -X POST http://localhost:8765/runs \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"requirement": "Build a bookmark manager REST API", "repo": "me/my-repo"}'
# Stream live logs
curl -N http://localhost:8765/runs/{run_id}/stream -H "X-API-Key: your-key"Connect from MCP tools (Copilot CLI, Claude Code, OpenCode):
# ~/.copilot/config.yaml (Copilot CLI)
mcp_servers:
- name: ai-software-house
url: http://localhost:8765/mcp
headers:
X-API-Key: "your-key"Configure in aisw_server.yaml. See docs/superpowers/specs/2026-05-14-integration-layer-design.md for the full API reference.
When agents write broken code — calling methods that don't exist, wiping config files, using wrong YAML formats — the root cause is almost always structural: agents fill gaps with plausible guesses from other codebases.
The accuracy system provides four layers of defence:
| Layer | Name | What it does |
|---|---|---|
| 1 | Prevention | Auto-injects real API signatures into role files; attaches relevant source files to engineer prompts; wires RAG for all agents |
| 2 | Detection | validation_gate stage: syntax → lint → tests before every PR; re-prompts engineer with exact error on failure (max 2 retries) |
| 3 | Learning | LearningAgent writes a "DO NOT" rule to the failing role file; the failure becomes permanent system prompt context |
| 4 | Bootstrap | BootstrapPatternsAgent reads a new repo's codebase and generates .github/copilot-instructions.md with Layer 1 cheatsheets from day zero |
Layer 1 alone prevents ~57% of structural bugs. Layers 2+3 catch and remember the rest. Layer 4 means new repos start protected.
The validation_gate is wired into all code-producing pipelines (ai-feature, ai-fix, tdd, ai-smart-fix). To add it to a custom pipeline:
stages:
- pm
- architect
- senior_engineer
- validation_gate # ← add this
- qa_engineerTo bootstrap a new repo:
python main.py --bootstrap --repo owner/new-repoGNU General Public License v3.0 or later (GPL-3.0-or-later). See LICENSE for details.
For operational topics not covered above, see docs/operations-guide.md:
| Section | Topic |
|---|---|
| §1 | Connecting to local Ollama (localhost, LAN, per-agent override) |
| §2 | Multi-Ollama pool: smart machine + low-end cluster |
| §3 | LiteLLM proxy for per-host concurrency control |
| §4 | RAG MCP — migration to a new machine or full rebuild |
| §5 | Reading GitHub issues, PRs, and comments (3 methods) |
| §6 | Pipeline self-chaining — auto re-label for follow-up runs |
| §7 | Per-repo deploy backends — docker, libvirt VM, or none |
| §8 | Agent Accuracy System — validation gate, LearningAgent, BootstrapPatternsAgent |
| Quick Start | Four scenarios: MVP, bug fix, new features, existing repo onboarding |