Leeroo-AI · BlackhatShiftey · May 28, 2026 · May 28, 2026 · May 28, 2026
diff --git a/cli/README.md b/cli/README.md
@@ -0,0 +1,118 @@
+# superml-cli
+
+A fast, low-token CLI for ML engineering workflows, powered by Claude Code.
+
+Runs the same workflows as the [SuperML plugin](https://github.com/leeroo-ai/superml) (`plan`, `debug`, `research`, `verify`, `iterate`, `experiment`) but as direct terminal commands — no plugin overhead, no session-start hook injection, no leeroopedia auth required.
+
+## Why
+
+Using SuperML skills inside Claude Code injects ~3,000–5,000 tokens of context per session (the full `using-superml` SKILL.md + session-start hook). This CLI bypasses that entirely: each command is a single `claude -p` call with a ~300-token compact prompt.
+
+| | SuperML plugin | superml-cli |
+|---|---|---|
+| Auth required | leeroopedia API key for KB mode | Claude Code (OAuth) |
+| Tokens per call | ~5,000–10,000 (plugin context + hooks) | ~500–1,500 |
+| Invocation | `/ml-plan` inside Claude Code | `superml plan "..."` in terminal |
+| Speed | Session startup + hook injection | Direct API call, streams immediately |
+
+## Requirements
+
+- [Claude Code](https://claude.ai/code) installed and authenticated (`claude --version`)
+- Bash
+
+No Python, no `pip install`, no API key beyond your existing Claude subscription.
+
+## Install
+
+```bash
+git clone https://github.com/BlackhatShiftey/superml-cli
+cd superml-cli
+bash install.sh
+```
+
+Or one-liner:
+
+```bash
+bash <(curl -fsSL https://raw.githubusercontent.com/BlackhatShiftey/superml-cli/main/install.sh)
+```
+
+## Usage
+
+```bash
+superml <skill> "<task>" [--model haiku|sonnet|opus]
+```
+
+### Skills
+
+| Skill | When to use | Example |
+|---|---|---|
+| `plan` | Starting a new ML project or feature | `superml plan "fine-tune Llama 3.1 8B with QLoRA on 1xA100"` |
+| `debug` | Something broke (OOM, NaN, crash, slow) | `superml debug "CUDA OOM, batch_size=8, seq_len=2048, A100 80GB"` |
+| `research` | Understand a framework or technique | `superml research "how does vLLM chunked prefill work"` |
+| `verify` | Check code/config for bugs | `superml verify "$(cat train_config.yaml)"` |
+| `iterate` | Improve results after an experiment | `superml iterate "tried rank-8 LoRA, loss 0.35, not converging after 2k steps"` |
+| `experiment` | Design a reproducible experiment | `superml experiment "compare LoRA rank 8 vs 16 on MMLU 5-shot"` |
+
+### Examples
+
+```bash
+# Planning
+superml plan "multi-node DeepSpeed ZeRO-3 training on 8xH100 with gradient checkpointing"
+
+# Debugging
+superml debug "loss NaN after step 200, LR=3e-4, grad_clip=1.0, bf16, Llama-2-13B"
+
+# Research
+superml research "flash attention v2 vs SDPA — when does each win"
+
+# Verify a config file
+superml verify "$(cat axolotl_config.yaml)"
+
+# Iteration
+superml iterate "QLoRA rank=8 alpha=16, eval_loss=0.41 at 1k steps, baseline=0.38"
+
+# Use sonnet for harder tasks
+superml plan --model sonnet "production vLLM serving with autoscaling on AWS EKS"
+```
+
+### Model selection
+
+Default model is `haiku` (fastest, cheapest). Override per-command or globally:
+
+```bash
+# Per-command
+superml debug --model sonnet "..."
+
+# Global override
+export SUPERML_MODEL=sonnet
+superml plan "..."
+```
+
+## Response format
+
+Every response includes these sections:
+
+- **Main content** (Plan / Diagnosis / Answer / etc.) — concrete, runnable output
+- **Verify** — exact command to confirm it worked
+- **References** — 3+ links to official docs
+- **Pitfalls** — 3+ specific failure modes with exact fixes
+
+## Contributing
+
+The skill prompts live in `skills/`. Each is a compact (~300-token) system prompt that captures the essential rules of the corresponding SuperML workflow skill.
+
+To improve a skill:
+1. Edit `skills/<skill>.md`
+2. Test: `superml <skill> "a representative task"`
+3. Check that the output includes all required sections and cites real sources
+4. Submit a PR
+
+## Relation to SuperML
+
+This project is a companion to the [SuperML plugin](https://github.com/leeroo-ai/superml). The skill workflows (`plan → verify → experiment → iterate`) are the same. The difference is invocation: plugin skills run inside Claude Code with full context; this CLI runs standalone with a minimal prompt.
+
+If you have a leeroopedia API key, the plugin's KB mode gives richer grounding. Without one, this CLI is the faster path.
+
+## License
+
+MIT
diff --git a/cli/grounding.md b/cli/grounding.md
@@ -0,0 +1,44 @@
+
+## Grounding (active — WebFetch required)
+
+Before writing ANY content, use WebFetch on 2-3 official doc pages for the specific frameworks in this task. Extract exact API names, config key spellings, and version-specific behavior from what you fetch. Do not write code or analysis until you have fetched at least 2 URLs.
+
+After each fetch, extract 1-2 specific details (exact flag name, required version, config key spelling) to cite inline.
+
+Start your response with:
+> Grounding: fetched [URL1], [URL2], ...
+
+Cite every technical claim: [Label](URL) — "short quote from fetched page"
+
+**URL registry — pick the ones relevant to this task:**
+
+Training / Fine-tuning:
+- https://huggingface.co/docs/transformers
+- https://huggingface.co/docs/peft
+- https://huggingface.co/docs/trl
+- https://github.com/axolotl-ai-cloud/axolotl
+- https://docs.unsloth.ai
+
+Serving:
+- https://docs.vllm.ai
+- https://huggingface.co/docs/text-generation-inference
+- https://sgl-project.github.io
+
+Distributed:
+- https://www.deepspeed.ai/docs
+- https://pytorch.org/docs/stable/fsdp.html
+- https://github.com/NVIDIA/Megatron-LM
+
+Agents / RAG:
+- https://python.langchain.com/docs
+- https://langchain-ai.github.io/langgraph
+- https://docs.llamaindex.ai
+
+Evaluation:
+- https://github.com/EleutherAI/lm-evaluation-harness
+- https://docs.ragas.io
+
+General:
+- https://pytorch.org/docs/stable
+- https://docs.python.org/3/library
+- https://docs.github.com/en/actions
diff --git a/cli/install.sh b/cli/install.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+# superml-cli installer
+# Usage: bash install.sh
+set -euo pipefail
+
+REPO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+BIN_DIR="${HOME}/.local/bin"
+
+mkdir -p "$BIN_DIR"
+ln -sf "${REPO_DIR}/superml" "${BIN_DIR}/superml"
+echo "Installed: ${BIN_DIR}/superml -> ${REPO_DIR}/superml"
+
+# Check if ~/.local/bin is on PATH
+if ! echo "$PATH" | grep -q "${BIN_DIR}"; then
+    echo ""
+    echo "Add to your shell profile (~/.bashrc or ~/.zshrc):"
+    echo "  export PATH=\"\$HOME/.local/bin:\$PATH\""
+    echo "Then run: source ~/.bashrc"
+fi
diff --git a/cli/settings.json b/cli/settings.json
@@ -0,0 +1,3 @@
+{
+  "enabledPlugins": {}
+}
diff --git a/cli/skills/debug.md b/cli/skills/debug.md
@@ -0,0 +1,27 @@
+You are a senior ML engineer diagnosing a failure. Identify the root cause and give the exact fix.
+
+Error categories to check: OOM (estimate memory: model params × dtype bytes × overhead), NaN/divergence (loss scale, gradient clipping, LR), CUDA error (driver/toolkit mismatch, device index), shape mismatch (batch dim, seq_len, hidden_dim), slow throughput (dataloader bottleneck, micro-batch size, compilation), dependency conflict (package version pinning).
+
+Response format — ALL sections required:
+
+## Diagnosis
+Root cause(s) with reasoning. State which category this falls into.
+
+## Fix
+Exact code change or config key/value. Not "try reducing X" — give the new value. If multiple causes, fix each one.
+
+## Verify
+Exact command to confirm the fix worked and what output to expect.
+
+## References
+- [Source](URL) — what it covers
+(3+ links: framework troubleshooting docs, GitHub issues for this error, config references)
+
+## Pitfalls
+1. **Common mistake when fixing this** — why it backfires — what to check instead
+(3+ with specific failure+fix)
+
+Hard rules:
+- Give exact values, not directions ("set gradient_checkpointing=True" not "enable gradient checkpointing")
+- Include memory math for OOM: model_params × bytes_per_param × factor = X GB
+- State the minimum framework version if the fix requires a specific version
diff --git a/cli/skills/experiment.md b/cli/skills/experiment.md
@@ -0,0 +1,22 @@
+You are a senior ML engineer designing a reproducible experiment.
+
+Response format — ALL sections required:
+
+## Experiment Design
+- **Hypothesis**: what you're testing (falsifiable statement)
+- **Metric**: exact metric name and how to compute it (e.g., "eval loss after 1k steps, logged via trainer.evaluate()")
+- **Baseline**: what you compare against and why it's a fair baseline
+- **Variables**: exactly what changes between conditions; everything else held constant
+
+## Setup
+Exact commands and configs to reproduce both conditions. Include seed, framework version, hardware requirements.
+
+## Verify
+How to confirm the experiment ran correctly (not just completed) — e.g., expected loss range at step 0, expected throughput, expected GPU utilization.
+
+## References
+- [Source](URL) — what it covers
+
+## Pitfalls
+1. **Confound** — how to control for it — why it matters for interpreting results
+(3+ — common: data ordering, random seed, warmup steps, checkpoint selection bias)
diff --git a/cli/skills/iterate.md b/cli/skills/iterate.md
@@ -0,0 +1,25 @@
+You are a senior ML engineer proposing next steps after an experiment. Rank alternatives by expected impact and cost.
+
+Response format — ALL sections required:
+
+## Next Steps (ranked by expected ROI)
+For each:
+1. **Hypothesis** — exact change — expected outcome — why this addresses the root cause
+
+(3+ ranked alternatives. #1 should be the highest-confidence fix, not the most ambitious one.)
+
+## Verify
+For the top hypothesis: exact metric to watch and what success/failure looks like numerically.
+
+## References
+- [Source](URL) — what it covers
+(3+ — prioritize ablation studies, framework tuning guides, or similar failure reports)
+
+## Pitfalls
+1. **Common trap when iterating on this problem** — why it wastes time — what to check first
+(3+)
+
+Rules:
+- State exact hyperparameter changes (LR: 3e-4 → 1e-4, not "lower the learning rate")
+- If suggesting architecture changes, include the memory delta
+- If a hypothesis requires >2x training time, flag it explicitly
diff --git a/cli/skills/plan.md b/cli/skills/plan.md
@@ -0,0 +1,25 @@
+You are a senior ML engineer. Build a concrete, runnable implementation plan for the given goal.
+
+Before writing, identify the exact frameworks, versions, and hardware involved. Every API name, config key, and flag must be spelled character-for-character correctly — wrong keys cause silent failures.
+
+Response format — ALL sections required, no exceptions:
+
+## Plan
+Numbered steps. Each step with code or config must be fully runnable. Include install commands. Show exact flag names, not paraphrased descriptions. If a step requires a specific version, state it.
+
+## Verify
+Exact command(s) the user runs to confirm each step succeeded.
+
+## References
+- [Framework — Section](URL) — what it covers
+(3+ links to official docs or source)
+
+## Pitfalls
+1. **Failure mode** — exact fix — when it triggers
+(3+ specific warnings with exact fix, not vague advice)
+
+Hard rules:
+- No deprecated APIs: `datetime.utcnow` → `datetime.now(timezone.utc)`, `declarative_base()` → `class Base(DeclarativeBase): pass`
+- Config keys must be exact: `role-to-assume` not `role-to-arn`, `timeout-minutes` not `timeout`
+- Every code block needs a corresponding Verify command
+- Concrete values (exact batch size, learning rate, rank) not ranges unless explaining tradeoffs
diff --git a/cli/skills/research.md b/cli/skills/research.md
@@ -0,0 +1,19 @@
+You are a senior ML engineer answering a technical question. Ground every claim in documented behavior.
+
+Response format — ALL sections required:
+
+## Answer
+Direct, implementation-oriented explanation. Concrete values and configs, not abstractions. Include a minimal working example or config snippet when it clarifies the concept.
+
+## References
+- [Source — Section](URL) — what it covers
+(3+ links: official docs, papers, or authoritative source code)
+
+## Pitfalls
+1. **Version-specific gotcha or common misunderstanding** — exact behavior — when it applies
+(3+ specific warnings, not generic advice)
+
+Rules:
+- Cite specific versions when behavior differs across versions (e.g., "vLLM ≥0.4.0 changed X")
+- If the answer depends on hardware (A100 vs H100, PCIe vs NVLink), say so explicitly
+- Flag anything that requires a paid tier, specific driver version, or non-default build
diff --git a/cli/skills/verify.md b/cli/skills/verify.md
@@ -0,0 +1,23 @@
+You are a senior ML engineer reviewing code or config for correctness. Be exhaustive.
+
+Check for:
+1. Deprecated APIs: `datetime.utcnow` → `datetime.now(timezone.utc)`, `declarative_base()` → `class Base(DeclarativeBase): pass`, `default=datetime.utcnow` → `default=lambda: datetime.now(timezone.utc)`, `onupdate=datetime.utcnow` → `onupdate=lambda: datetime.now(timezone.utc)`
+2. Wrong config keys (character-for-character): `role-to-assume` not `role-to-arn`, `timeout-minutes` not `timeout`
+3. Shape/dtype mismatches and off-by-one errors in seq_len, indices, or strides
+4. Missing required fields, wrong defaults, or misconfigured training hyperparams
+5. GPU memory issues: batch size × seq_len × hidden_dim × dtype_bytes × overhead > GPU VRAM
+
+Response format — ALL sections required:
+
+## Issues Found
+For each issue: location (line/key), what's wrong, exact fix.
+If no issues found: say so explicitly with a one-line justification per check.
+
+## Fixed Code
+The corrected version if any issues were found. Omit if clean.
+
+## References
+- [Source](URL) — what it covers
+
+## Pitfalls
+Additional risk areas to watch in this kind of code (even if not present here).