Local-first, multi-agent equity research crew for long-term fundamental investment decisions. Runs entirely on your machine; no cloud LLM API spend.
Usage guide: 한국어 · English · 日本語 · 简体中文
Design doc: design-v2.2.md Formal MVP evaluation: docs/MVP_EVALUATION.md
Feed a ticker in. A 6-agent crew produces a cited research note in ~15–20 minutes:
Economist → Analyst → Valuer → Skeptic → Defender → Steward
│ │
(debate round) ─┘ └─ BUY / HOLD / PASS + conviction
Every number in the report traces to a Python-computed source
(Finnhub/FRED/DART/SEC EDGAR). A deterministic Python audit downgrades
any verdict that violates the discipline matrix — the LLM can't
overclaim. A separate citation-grounding audit flags any [Source: edgar.*] citation whose number doesn't actually appear in the
retrieved 10-K passage.
Supported markets:
- US equities via Finnhub + SEC EDGAR (via direct ChromaDB RAG)
- Korean equities via OpenDART (KRX 6-digit codes;
.KS/.KQsuffixes stripped automatically)
The system was rebuilt around constitution v2.0
(docs/constitution.md) — a
universe-driven 6-stage pipeline that fights user bias instead of
amplifying it. Six commitments are enforced in code:
- User preferences must not influence universe membership. Tip ingestion is decoupled from analysis triggering.
- System has PASS authority. No human override of axis verdicts.
- Precision over recall. Missing data routes to NEED_LLM, never silent PASS.
- No Dreamer module. Optimism is not a separate agent.
- Fixed hierarchy. Three axes (moat / new_frontier / bottleneck) evaluated per ticker; the gate requires 2+ axes including a growth axis.
- Binary output. BUY or PASS; no HOLD-as-fence-sitting.
Six-stage pipeline:
Stage 1: Universe → Stage 2: quant prefilter → Stage 3: light LLM screen
│
(axis-aligned debate) ←──────┤
│ ▼
Stage 4: Skeptic → Defender → Steward + audit ──→ Stage 5: value chain
positioning
│
▼
Stage 6: HRP portfolio
+ 1%-30% bounds
+ cluster trim
Calibration is back-validation against historical outcomes
(run_back_validation.py → ledger entry → run_ledger_analysis.py
for confusion-matrix metrics). User intuition does NOT enter the
calibration loop.
| Layer | Status | What's in it |
|---|---|---|
| Stage 2 quant prefilter | ✅ | Per-axis evaluate_ticker with NEED_LLM routing for missing data |
| Stage 3 light LLM screen | ✅ | Constitution §18 prompt, deterministic hierarchy gate |
| Stage 4 v2 prompts | ✅ | Axis-aligned attack distribution, strict-concede Defender, 4-rule Steward audit |
| Stage 5 value chain positioning | ✅ | NetworkX clustering + over/under-representation flags |
| Stage 6 HRP portfolio | ✅ | López de Prado HRP + 1%/30% bounds + cluster collision adjustment |
| Live screening (US Finnhub + KR DART) | ✅ | Symbol dispatcher, 7/7 KR tickers verified |
| Back-validation (Finnhub historical) | ✅ | Exact filed_date filter, 5-year horizon, 17/30 manifest tickers |
| Calibration ledger + analysis | ✅ | TP/FP/TN/FN classifier metrics across constitution versions |
| Tip annotation surface | ✅ | Read-only "user mentioned N days ago" metadata; never enters LLM context |
The legacy v1 6-agent crew runner (scripts/run_crew.py) remains
available for single-ticker deep-dives; the v2 path
(scripts/run_crew_v2.py) gates on Stage 2 + Stage 3 first and uses
constitution-§19/§20/§21 prompts.
# Live screening (Stage 2 + optional Stage 3)
python scripts/run_screening.py NVDA --with-stage3
python scripts/run_screening.py --universe data/calibration/manifest.yaml \
--with-peers --with-rag-signals --with-tip-annotations
# Stage 4 v2 crew (gated by Stage 2/3)
python scripts/run_crew_v2.py NVDA
# Back-validation (uses Finnhub historical by default)
python scripts/run_back_validation.py
python scripts/run_back_validation.py --limit 3 --no-write # dry-run
# Calibration ledger analysis
python scripts/run_ledger_analysis.py # latest entry
python scripts/run_ledger_analysis.py --list
python scripts/run_ledger_analysis.py --compare a.json b.json
# Stage 5 value chain positioning
python scripts/run_stage5_positioning.py NVDA AMD INTC
# Stage 6 portfolio construction (HRP + bounds + cluster trim)
python scripts/run_portfolio_construction.py NVDA AMD INTC \
--graph data/value_chain.graph.json --positions positions.yaml
# Tip log gap analysis
python scripts/run_tip_gap_analysis.py NVDA AAPL MSFT --window-days 90| Phase | Status | What's in it |
|---|---|---|
| Phase 1 MVP | ✅ Complete | 5 agents + quality metrics — formal GO verdict |
| Phase 2 | ✅ 98% | Economist + Steward + Defender + debate round + 3-layer audit + portfolio SQLite + auto-onboarding |
| Phase 3 | ✅ 98% | 3-Tier registry + SEC EDGAR RAG + DART + chain alerts + pre-filter stages 1–3 + dedup ledger |
| Phase 4 | 🟡 75% | Paper trading ledger + auto-record on crew completion + regression-diff tool |
| LLM backend abstraction | ✅ | Pluggable backends (Ollama / OpenAI-compat / MLX / llama.cpp) + per-agent model+sampling routing via config/agent_models.yaml |
Sampling policy: each agent uses its model's published recommended
sampling (e.g. Qwen 2.5 → temperature=0.7/top_p=0.8). Same prompt
twice can produce different outputs. See
docs/llm_backends.md for the full policy
discussion and how to opt back into deterministic mode per-agent.
- Python 3.13+
- Ollama running locally
- Finnhub API key (free; finnhub.io)
- FRED API key (free; fredaccount.stlouisfed.org)
- OpenDART API key (free, for Korean stocks; opendart.fss.or.kr)
- Telegram bot (optional, for push notifications)
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
# 16k context variants used by the crew. The :7b-16k / :8b-16k tags
# are custom Modelfile aliases that raise num_ctx from the 4096
# default — Stage 3 + Defender prompts exceed 4K. Build with:
bash scripts/create_qwen_16k.sh # qwen2.5:7b-16k (Stage 3, Analyst, Valuer, Steward)
bash scripts/create_llama_16k.sh # llama3.1:8b-16k (Skeptic)Want a different backend? See Alternative LLM backends — MAFIS supports Apple Silicon (MLX), GGUF (llama.cpp), and any OpenAI-compatible server (vLLM, LM Studio, mlx_lm.server, …) without changing the agent code.
Copy .env.example to .env and fill in:
FINNHUB_API_KEY=...
FRED_API_KEY=...
DART_API_KEY=... # required only for Korean tickers
TELEGRAM_BOT_TOKEN=... # optional
TELEGRAM_CHAT_ID=... # optionaluv venv
source .venv/bin/activate
uv pip install -e ".[dev]"python scripts/verify_env.py # API keys + Ollama reachable
pytest # should report 1150+ passed (constitution v2.0)python scripts/onboard_ticker.py AMD --tier 2 --notes "GPU peer of NVDA"This pulls Finnhub profile + peers, downloads the latest 10-K,
indexes it into ChromaDB, drafts a value chain brief via Qwen, and
registers the ticker in config/tickers.yaml. Output:
docs/value_chains/AMD.draft.md — review the Vulnerable links
section, then:
mv docs/value_chains/AMD.draft.md docs/value_chains/AMD.mdKorean tickers work the same way — the dispatcher detects 6-digit codes and routes through DART:
python scripts/onboard_ticker.py 005930 --tier 1 # Samsung Electronicspython scripts/run_crew.py NVDA # US
python scripts/run_crew.py 005930 # KoreanOutput:
reports/<SYMBOL>_YYYYMMDD_HHMM.crew.md— six-section report + audit blockreports/<SYMBOL>_...meta.txt— timing / char counts / models used- Auto-inserted row in
data/portfolio.sqlitepaper-trades table - Optional Telegram push of the Korean summary
python scripts/portfolio_cli.py add NVDA --shares 10 --cost 5000 --tier 1
python scripts/portfolio_cli.py weights # live Finnhub quotes
python scripts/portfolio_cli.py gap NVDA --low 3 --high 5python scripts/paper_ledger.py list # all recorded verdicts
python scripts/paper_ledger.py returns # mark-to-market
python scripts/paper_ledger.py summary # win rate, audit effect# One-off scan (prints alerts; won't fire duplicates when --dedup)
python scripts/scan_chain_alerts.py --dedup --hops 2
# Cron-friendly (with Telegram push)
0 9-16 * * 1-5 cd ~/MAFIS && /path/to/.venv/bin/python \
scripts/scan_chain_alerts.py --dedup --telegram \
>> /var/log/mafis_alerts.log 2>&1python scripts/prefilter_scan.py --graph-context --semanticRuns Stages 1 (keyword), 2 (value-chain context), and 3 (Qwen materiality filter) against the news pool and recommends promotions.
python scripts/regression_compare.py \
reports/NVDA_20260424_1557.crew.md \
reports/NVDA_20260425_0900.crew.md \
--fail-on-regressionsrc/wise_investor/
├── agents/ # crew: analyst, valuer, skeptic, defender, steward, economist
│ ├── steward_audit.py # discipline matrix + speculative-language + Defender-aware
│ └── runner.py # pre_gather_facts dispatcher (US → Finnhub, KR → DART)
├── data/
│ ├── finnhub.py # US fundamentals
│ ├── dart.py # Korean fundamentals (OpenDART)
│ ├── dart_facts.py # KR → crew facts adapter (KRW→USD via FRED)
│ ├── fred.py # macro snapshot (Economist)
│ └── cross_validate.py
├── rag/
│ ├── edgar.py # SEC EDGAR downloader + cache
│ ├── sections.py # Business / Risk Factors / MD&A / Quant Market Risk extractor
│ ├── index.py # ChromaDB persistent store
│ └── integration.py # crew pre_gather hook
├── geopolitics/
│ ├── gdelt.py # GDELT DOC 2.0 client
│ ├── google_news.py # RSS parser
│ └── snapshot.py # per-symbol geopolitical context
├── alerts/
│ ├── chain_alerts.py # value-chain graph × news → target alerts
│ └── ledger.py # SQLite dedup + cooldown
├── filters/
│ ├── pre_filter.py # Stages 1 (keyword) + 2 (graph context)
│ └── semantic.py # Stage 3 Qwen materiality filter
├── onboarding/
│ ├── brief_generator.py # Finnhub + 10-K + geo → Qwen-drafted value chain brief
│ └── tickers_yaml.py # 3-Tier registry CRUD
├── portfolio/
│ └── store.py # positions + sizing-gap helper
├── paper_trading/
│ ├── ledger.py # paper_trades table + performance metrics
│ └── report_parser.py # parse Steward verdict + audit flag from crew report
├── regression/
│ └── compare.py # structured crew-report diff tool
├── value_chain/
│ ├── graph.py # NetworkX-backed typed DiGraph
│ └── parser.py # docs/value_chains/*.md → graph
├── quality/
│ ├── metrics.py # 6 automated quality scores
│ └── citation_audit.py # edgar.* grounding + Skeptic mandate audit
└── notify/
└── telegram.py
scripts/ # CLI entry points for every component above
docs/value_chains/ # hand-curated + auto-drafted briefs (*.md vs *.draft.md)
data/ # portfolio.sqlite, chroma/, edgar_cache/, facts_cache/
tests/ # 750+ tests (offline; live ones marked -m network)
- Local-first, API-last: Phase 1 runs with $0 LLM spend. Finnhub / FRED / GDELT / DART are free public APIs.
- LLM is judgment, Python is calculation: every dollar value, ratio,
and growth rate is computed by
src/wise_investor/tools/ordata/and fed to the LLM as prepared facts. The LLM synthesizes narrative, never arithmetic. - Sampling follows model recommendations: each agent uses the
sampling profile published by its model author (Qwen 2.5: 0.7/0.8;
Llama 3.x: 0.7/0.9; Qwen3 thinking: 0.6/0.95/min_p=0). Two runs of
the same crew may differ; the audit + citation system enforce
within-run consistency, not run-to-run reproducibility. Opt back
into deterministic mode per agent in
config/agent_models.yaml. - Multi-layer audit: discipline matrix (verdict vs labels) + speculative-language detector + Defender-aware correction + edgar citation grounding + Skeptic mandate compliance. The LLM can emit any narrative; Python enforces the rules.
- Paper trading before real trading: every Steward verdict is
automatically recorded with entry price.
paper_ledger.py summarytells you whether BUY verdicts actually outperform PASS verdicts over time — the only objective answer to "is this system useful?".
- Create a bot with @BotFather → copy the token.
- Send any message to your bot (creates the chat).
- Visit
https://api.telegram.org/bot<TOKEN>/getUpdates→ copy thechat.id. - Add to
.env:TELEGRAM_BOT_TOKEN=... TELEGRAM_CHAT_ID=... run_crew.pyauto-pushes a Korean summary;scan_chain_alerts.py --telegrampushes chain alerts.
No configuration → silent skip, no errors.
- Korean-ticker crew runs share the English agent prompts; the Analyst will produce English analysis of Korean financials. A follow-up will branch the prompts by source country.
- Value chain graph auto-update from 10-K text is not yet implemented. Briefs are either hand-curated or onboarding- drafted and then hand-reviewed.
- No paper-trade position sizing — the ledger records Steward verdicts only; actual position sizing per trade is manual.
- No OpenClaw integration (design §8.1); Telegram covers the equivalent role.
See docs/MVP_EVALUATION.md for the Phase 1 formal evaluation and Phase 2+ priorities.