capabilities for Claude Code, GitHub Copilot, Gemini CLI, and any compliant agent framework.
Recent milestones: v1.3 — Hardened SQLite control plane (May 2026) · v1.4 — MAF synthesis & hybrid runtime strategy (May 31, 2026) · v1.5 — CLI Agents major update (June 2026)
Replaced fragile markdown-based state with a transactional SQLite control plane (state_engine.py), added strong process sandboxing (sandbox_runner.py), HMAC-signed envelopes, approval gating, and WAL concurrency safety. Implementation is stdlib-only (sqlite3, hmac, hashlib, subprocess, os, secrets) — no framework dependencies. This made the custom Python kernel production-grade and laid the foundation for the v1.4 hybrid strategy.
After extensive MAF research and 12 hands-on C# experiments (including full loading of real exploration-cycle-plugin manifests), we pivoted from "do not adopt MAF" to a hybrid architecture:
Manifest-first. Multiple certified runtime adapters second.
Key outcomes:
- Kept the hardened Python control plane as the authoritative kernel
- Adopted AGT (Agent Governance Toolkit) for deterministic policy enforcement
- Ported 4 high-value patterns from MAF: alias resolution, standardized handoff envelopes, per-agent skill scoping, per-phase premium call budgets
- MAF is now a certified optional runtime adapter alongside Claude Code, Copilot CLI, and Gemini CLI (ADR-007)
- All
.mdagent manifests andSKILL.mdfiles remain fully portable
This hybrid approach gives us the best of both worlds: battle-tested custom safety primitives + selective leverage of Microsoft's well-engineered patterns.
References: ADR-001 · ADR-002 · ADR-007
cli-agents plugin promoted from a basic CLI dispatcher to a full multi-LLM task routing suite with adversarial agent pattern support.
Key outcomes:
run_agent.pytask router: 6 backends, argparse v2,--isolatedsecurity contract, codex stdin pattern. 76 TDD tests across 3 files.- ~2s wall clock for
--cli llamadirect HTTP to llama-server (measured: 1.977s). 20–30x faster than Mode A proxy path. - 11 expert agent personas with structured analytical frameworks: OWASP, C4, SOLID, Big-O, TOGAF-level depth. Adversarial pattern family: red-team-reviewer, debate-synthesizer, output-validator, self-critic.
local-llm-setupskill with scripts/ symlinks: Day 1 bootstrap for macOS Metal / Windows CUDA/Vulkan / Linux CUDA/ROCm.- KV Cache Orchestrator (P0 collision fix):
_extract_cache_key()returnsNonefor system-prompt-free requests. 8 new proxy tests. - Plugin manifests (
plugin.yaml,plugin.json,marketplace.json) fully corrected and aligned.
A strictly cross-platform (Windows, Mac, Ubuntu) library — the universal upstream source for reusable AI agent plugins and skills across multiple IDEs and agent frameworks: Claude Code, GitHub Copilot, Gemini CLI, Antigravity, Roo Code, Windsurf, Cursor, and other compliant integrations.
All plugins deploy to the single .agents/ folder standard — no duplicate copies needed for .github, .gemini, .agent, etc.
Important
Start here — fresh clone or first-time setup. The single .agents/ environment directory is not committed to your repo. It will be empty by default.
All installation methods (uvx, bootstrap.py, npx skills, and Marketplace / Extension CLI) are now consolidated in a single authoritative guide:
Quick install (all plugins):
uvx --from git+https://github.com/richfrem/agent-plugins-skills plugin-add richfrem/agent-plugins-skillsv1.4 note: If upgrading from v1.3, run
uv sync(orpip install -r requirements.txt) after pulling latest — the per-phase budget enforcement and AGT governance patterns add new dependencies toexploration-cycle-plugin.
This repository is built on a pragmatic acceptance of the current AI engineering landscape: the ecosystem changes weekly, and workflows that were revolutionary six months ago are obsolete today.
Frameworks like agent-agentic-os and spec-kitty are treated as Transitional Architectures — bridges between what agents need to do today and what native SDKs will eventually handle. When Anthropic, Google, and GitHub harden native memory persistence, execution safety, and multi-agent orchestration, large swaths of this tooling will be happily discarded.
The MAF research (May 2026) reinforced this view: instead of choosing between a custom kernel and a framework, we now deliberately pursue a hybrid model:
- Portable
.mdmanifests andSKILL.mdfiles remain the source of truth across all runtimes - Multiple runtime adapters (Claude Code, Copilot CLI, Gemini CLI, MAF) are supported side-by-side
- Strong custom control plane for safety and governance that no hosted framework currently matches
- Selective adoption of excellent patterns from frontier frameworks (e.g. MAF's typed handoffs and AGT governance)
Skills are Applications; the SDK is the OS. Individual skills must function in complete isolation — no hard dependencies on sibling plugins, no assumptions about which framework is running.
The OS implements an eval-gated improvement pipeline for autonomous skill evolution:
os-architect ← intent classifier + ecosystem router
↓
os-improvement-loop ← learning engine: orchestrates multi-iteration improvement
↓
os-eval-runner ← inner gate: KEEP/DISCARD per iteration (evaluate.py)
↓
os-eval-backport ← human gate: review before lab winner → production
↓
os-experiment-log ← scientific backbone: longitudinal tracking + synthesis
Entry point: /os-architect — describe what you want in plain language. The agent classifies intent, audits the ecosystem, proposes Path A/B/C, and dispatches via your available CLI tools. os-evolution-planner writes the task plan + delegation prompt. os-architect-tester validates after any changes.
Skills that score HIGH on the autoresearch viability rubric (objectivity + speed + frequency + utility) can run fully autonomous self-improvement loops:
mutate SKILL.md → evaluate.py → exit 0 (KEEP) or exit 1 (DISCARD) → repeat
Not all skills are good candidates — use eval-autoresearch-fit to score a skill before running a loop.
Live example — convert-mermaid skill, 26 iterations across 2 rounds: 0.61 → 1.00
Each blue diamond is a baseline anchor (one per session). Green = new best score. Amber = kept but not a record. The two-segment shape shows a fresh re-baseline for round 2.
Monitor a live run: python plugins/agent-agentic-os/scripts/plot_eval_progress.py --tsv <lab>/evals/ --live
Flywheel layers:
- OUTER flywheel (
os-improvement-loop): improves OS-level protocols and session ledgers between sessions - INNER flywheel (
os-eval-runner): evaluate.py KEEP/DISCARD gate per iteration within a session
5 composable primitives used as the execution substrate by the Improvement OS and standalone by any agent workflow:
learning-loop · dual-loop · agent-swarm · red-team-review · triple-loop-learning
O(1) RLM keyword → O(log N) vector semantic → wiki concept nodes.
Super-RAG stack: rlm-factory (O(1) keyword) + vector-db (O(log N) semantic) + obsidian-wiki-engine (full concept nodes)
Each plugin works standalone (Mode A) or combined for full Super-RAG power. Init agents detect what is installed in .agents/skills/ and configure only the available layers.
All shared scripts live once at plugins/<plugin>/scripts/. Skills reference them via file-level symlinks (skills/<skill>/scripts/script.py → ../../../scripts/script.py). Directory-level symlinks are forbidden — npx drops them on install.
The flagship operational framework. Eval-gated improvement loops, memory management, session lifecycle, and ecosystem evolution orchestration.
Skills (17): os-architect · os-evolution-planner · os-guide · os-improvement-loop · os-eval-lab-setup · os-eval-runner · os-eval-backport · os-environment-probe · os-evolution-verifier · os-experiment-log · os-memory-manager · os-improvement-report · os-init · os-clean-locks · todo-check · optimize-agent-instructions · self-evolution
Agents (5): os-architect-agent · os-architect-tester-agent · improvement-intake-agent · os-health-check · agentic-os-setup
Enterprise-grade Spec → Plan → Tasks → Implement → Review → Merge pipeline.
Skills (19): spec-kitty-specify · spec-kitty-plan · spec-kitty-tasks · spec-kitty-implement · spec-kitty-review · spec-kitty-merge · spec-kitty-analyze · spec-kitty-accept · spec-kitty-clarify · spec-kitty-research · spec-kitty-dashboard · spec-kitty-status · spec-kitty-checklist · spec-kitty-constitution · spec-kitty-tasks-outline · spec-kitty-tasks-finalize · spec-kitty-tasks-packages · spec-kitty-workflow · spec-kitty-sync-plugin
Agents: spec-kitty-agent · spec-kitty-setup
Autonomous discovery loop: idea framing → business requirements → user stories → prototype → handoff into formal engineering specs.
Skills (19): exploration-workflow · exploration-session-brief · discovery-planning · business-requirements-capture · business-workflow-doc · user-story-capture · exploration-handoff · exploration-optimizer · prototype-builder · visual-companion · subagent-driven-prototyping · vibe-browser-audit · vibe-behavioral-test-capture · vibe-domain-extractor · vibe-slice-migrator · vibe-reengineer · vibe-spec-packager · vibe-togaf-architect · vibe-to-speckit-superpowers
Agents (17): business-rule-audit-agent · certification-verifier · discovery-planning-agent · domain-purity-auditor · exploration-cycle-orchestrator-agent · handoff-preparer-agent · intake-agent · planning-doc-agent · problem-framing-agent · prototype-builder-agent · prototype-companion-agent · requirements-doc-agent · requirements-scribe-agent · runtime-observer-agent · semantic-drift-auditor · vibe-orchestrator-agent · subagent-driven-prototyping-agent
5 execution primitives used as the substrate for the Improvement OS and standalone agent workflows.
Skills (6): orchestrator · learning-loop · dual-loop · agent-swarm · red-team-review · triple-loop-learning
Agents: orchestrator
Interactive creators for exact file hierarchies + structured audit framework for plugin architectural maturity.
Scaffolding skills: create-plugin · create-skill · create-sub-agent · create-command · create-hook · create-github-action · create-agentic-workflow · create-azure-agent · create-docker-skill · create-mcp-integration · create-stateful-skill
Audit & analysis skills: audit-plugin · audit-plugin-l5 · l5-red-team-auditor · analyze-plugin · self-audit · mine-skill · mine-plugins · path-reference-auditor · fix-plugin-paths · synthesize-learnings · eval-autoresearch-fit · manage-marketplace · ecosystem-standards · ecosystem-authoritative-sources
run_agent.py dispatches bounded tasks to 6 backends. Measured: ~2s wall clock for --cli llama (direct HTTP to llama-server, no proxy, no 29K system prompt overhead).
Skills (12):
local-llm-bridge—--cli llama: direct Gemma 4 12B, ~2s, no proxylocal-llm-setup— cross-platform setup wizard; scripts/ symlinks for Day 1 bootstrap + Mode B configcodex-cli-agent—--cli codex: Codex/OpenAI-compatible, prompt piped via stdinagy-cli-agent—--cli agy: Antigravity CLI, frontier Gemini modelsclaude-cli-agent—--cli claude: Claude CLI, Haiku 4.5 defaultcopilot-cli-agent—--cli copilot: GitHub Copilot CLI, gpt-5-mini⚠️ AI Credits June 2026gemini-cli-agent—--cli gemini: Gemini CLI, gemini-3-flash-previewclaude-project-setup·antigravity-project-setup·project-setup·maf-adapter·agt-security
11 Expert Agent Personas (flat agents/ directory, shared across all backends):
| Persona | Role | Pattern Family |
|---|---|---|
refactor-expert |
Code quality — SOLID/DRY smell taxonomy | Code Review |
security-auditor |
OWASP vulnerability audit | Code Review |
architect-review |
C4/SOLID structural review, layer violations | Code Review |
compliance-reviewer |
Coding standards drift detection | Code Review |
pr-reviewer |
Diff review — ship/hold decision | Code Review |
test-writer |
Unit test generation — all path types | Code Review |
performance-analyst |
Bottleneck analysis — Big-O, I/O amplification | Code Review |
red-team-reviewer |
Adversarial exploit analysis, attack surface | Adversarial |
debate-synthesizer |
Dialectical synthesis, conflict resolution | Adversarial |
output-validator |
Output guardrail — hallucination/schema/policy | Adversarial |
self-critic |
Reflection loop — task-fit, completeness check | Adversarial |
KV Cache Orchestrator: kv_cache_orchestrator.py — SHA-256 keyed slot save/restore, 4 GiB budget, 31 TDD tests. Proxy integration wired. Eviction scoring inspired by antirez/ds4.
What changed in v2.0.0 (June 2026):
- 12 duplicate agent files (3 personas × 4 backends) → 11 deep flat personas with OWASP/C4/SOLID analytical frameworks
- Added adversarial pattern family: red-team-reviewer, debate-synthesizer, output-validator, self-critic
run_agent.pyargparse v2:--cli,--model,--max-tokens,--isolated+ legacy positional compat- Security contract:
--isolatedsuppresses--yolo/--dangerously-skip-permissionsper backend - Codex stdin:
codex exec --model M -(avoids ARG_MAX + process listing exposure) local-llm-setupskill with scripts/ symlinks for Day 1 bootstrapplugin.yamlstale skills list corrected (4 non-existentlocal-llm-bridge-*removed; all 12 real skills listed)
Behavioural guardrails enforcing best practices on every coding session. These skills come from obra/superpowers — install that plugin to get them.
Install: uvx --from git+https://github.com/richfrem/agent-plugins-skills plugin-add obra/superpowers
Skills available via superpowers: verification-before-completion · test-driven-development · using-git-worktrees · systematic-debugging · finishing-a-development-branch · requesting-code-review
Three standalone plugins consolidated: rlm-factory (O(1) keyword search) + vector-db (semantic search) + memory-management (session tiering). Works standalone per layer or combined as a full Super-RAG stack.
RLM skills (6): rlm-init · rlm-curator · rlm-search · rlm-distill-agent · rlm-cleanup-agent · rlm-audit
Vector DB skills (6): vector-db-init · vector-db-launch · vector-db-ingest · vector-db-search · vector-db-cleanup · vector-db-audit
Session memory (1): memory-management — multi-tiered cognition and context caching
Agents (9): rlm-cleanup-agent · rlm-curator · rlm-distill-agent · rlm-factory-init-agent · rlm-init · rlm-search · vector-db-cleanup · vector-db-ingest · vector-db-init-agent
Karpathy-style LLM wiki with cross-source concept synthesis. Transforms raw markdown into structured, queryable concept nodes. Full Obsidian vault CRUD, canvas, and graph traversal. Pairs with agent-memory as Phase 3 of the Super-RAG stack.
Wiki skills: obsidian-wiki-builder · obsidian-rlm-distiller · obsidian-query-agent · obsidian-wiki-linter
Vault skills: obsidian-init · obsidian-vault-crud · obsidian-canvas-architect · obsidian-graph-traversal · obsidian-markdown-mastery · obsidian-bases-manager
Setup agents: wiki-init-agent · wiki-build-agent · wiki-distill-agent · wiki-lint-agent · wiki-query-agent · super-rag-setup-agent
Nine standalone plugins consolidated into one. All tools are stateless and self-contained.
Skills (12): adr-management · coding-conventions-agent · context-bundler · convert-mermaid · hf-init · hf-upload · humanize · link-checker-agent · optimize-context · red-team-bundler · symlink-manager · task-agent
Agents (3): coding-conventions-agent · link-checker-agent · rsvp-comprehension-agent
Skills (3): plugin-installer · plugin-remover · plugin-syncer
Cross-platform pip-compile with strict .in → .txt lockfile discipline.
Skills (1): dependency-management
Scored all 116/120 production skills for Karpathy autoresearch loop viability using GPT-5 mini via Copilot CLI. Each skill scored on: objectivity (can a shell command measure it?), execution speed, frequency of use, and potential utility (max 40).
Top HIGH candidates:
| Rank | Skill | Score | Loop |
|---|---|---|---|
| 1 | superpowers/verification-before-completion | 35/40 | LLM_IN_LOOP |
| 2 | superpowers/test-driven-development | 35/40 | LLM_IN_LOOP |
| 3 | coding-conventions/coding-conventions-agent | 34/40 | HYBRID |
| 4 | superpowers/using-git-worktrees | 33/40 | DETERMINISTIC |
| 5 | spec-kitty-plugin/spec-kitty-status | 33/40 | DETERMINISTIC |
| 6 | agent-agentic-os/os-eval-runner | 32/40 | DETERMINISTIC |
Full ranked results: summary-ranked-skills.json
Top 20 opportunities with metrics + blockers: autoresearch-opportunities-report.md
Regenerate report:
python plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/scripts/update_ranked_skills.py \
--json-path plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/assets/resources/summary-ranked-skills.json \
--morning-reportplugins/ ← upstream source (11 plugins, 137 skills)
<plugin>/
plugin.yaml ← plugin manifest
.claude-plugin/plugin.json
skills/<skill>/
SKILL.md ← skill definition (mutation target for autoresearch loops)
evals/evals.json ← routing evaluation suite (should_trigger boolean schema)
evals/results.tsv ← per-experiment score history
scripts/ ← file-level symlinks → ../../scripts/
scripts/ ← canonical scripts (shared via symlinks, never duplicated)
agents/ ← sub-agent .md definitions
commands/ ← slash commands
assets/diagrams/ ← architecture diagrams
.agents/ ← deployed skill copies (bridge installer output)
skills/
agents/
plugin-research/ ← experiments and autoresearch infrastructure
experiments/
analyze-candidates-for-auto-reseaarch/
temp/ ← local scratch (gitignored except scripts)
ecosystem-fitness-sweep-v1/
137 skills · 11 plugins · Improvement OS (os-architect) · Karpathy autoresearch loops · Super-RAG 3-tier retrieval
