Skip to content

ccy5123/mafis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MAFIS — Wise Investor System

Local-first, multi-agent equity research crew for long-term fundamental investment decisions. Runs entirely on your machine; no cloud LLM API spend.

Usage guide: 한국어 · English · 日本語 · 简体中文

Design doc: design-v2.2.md Formal MVP evaluation: docs/MVP_EVALUATION.md


What it does

Feed a ticker in. A 6-agent crew produces a cited research note in ~15–20 minutes:

Economist → Analyst → Valuer → Skeptic → Defender → Steward
                                          │           │
                          (debate round) ─┘           └─ BUY / HOLD / PASS + conviction

Every number in the report traces to a Python-computed source (Finnhub/FRED/DART/SEC EDGAR). A deterministic Python audit downgrades any verdict that violates the discipline matrix — the LLM can't overclaim. A separate citation-grounding audit flags any [Source: edgar.*] citation whose number doesn't actually appear in the retrieved 10-K passage.

Supported markets:

  • US equities via Finnhub + SEC EDGAR (via direct ChromaDB RAG)
  • Korean equities via OpenDART (KRX 6-digit codes; .KS / .KQ suffixes stripped automatically)

Constitution v2.0 — universe-driven discovery (current)

The system was rebuilt around constitution v2.0 (docs/constitution.md) — a universe-driven 6-stage pipeline that fights user bias instead of amplifying it. Six commitments are enforced in code:

  1. User preferences must not influence universe membership. Tip ingestion is decoupled from analysis triggering.
  2. System has PASS authority. No human override of axis verdicts.
  3. Precision over recall. Missing data routes to NEED_LLM, never silent PASS.
  4. No Dreamer module. Optimism is not a separate agent.
  5. Fixed hierarchy. Three axes (moat / new_frontier / bottleneck) evaluated per ticker; the gate requires 2+ axes including a growth axis.
  6. Binary output. BUY or PASS; no HOLD-as-fence-sitting.

Six-stage pipeline:

Stage 1: Universe → Stage 2: quant prefilter → Stage 3: light LLM screen
                                                          │
                            (axis-aligned debate) ←──────┤
                                  │                       ▼
        Stage 4: Skeptic → Defender → Steward + audit ──→ Stage 5: value chain
                                                                  positioning
                                                                  │
                                                                  ▼
                                                         Stage 6: HRP portfolio
                                                                  + 1%-30% bounds
                                                                  + cluster trim

Calibration is back-validation against historical outcomes (run_back_validation.py → ledger entry → run_ledger_analysis.py for confusion-matrix metrics). User intuition does NOT enter the calibration loop.

Layer Status What's in it
Stage 2 quant prefilter Per-axis evaluate_ticker with NEED_LLM routing for missing data
Stage 3 light LLM screen Constitution §18 prompt, deterministic hierarchy gate
Stage 4 v2 prompts Axis-aligned attack distribution, strict-concede Defender, 4-rule Steward audit
Stage 5 value chain positioning NetworkX clustering + over/under-representation flags
Stage 6 HRP portfolio López de Prado HRP + 1%/30% bounds + cluster collision adjustment
Live screening (US Finnhub + KR DART) Symbol dispatcher, 7/7 KR tickers verified
Back-validation (Finnhub historical) Exact filed_date filter, 5-year horizon, 17/30 manifest tickers
Calibration ledger + analysis TP/FP/TN/FN classifier metrics across constitution versions
Tip annotation surface Read-only "user mentioned N days ago" metadata; never enters LLM context

The legacy v1 6-agent crew runner (scripts/run_crew.py) remains available for single-ticker deep-dives; the v2 path (scripts/run_crew_v2.py) gates on Stage 2 + Stage 3 first and uses constitution-§19/§20/§21 prompts.

v2.0 CLI reference

# Live screening (Stage 2 + optional Stage 3)
python scripts/run_screening.py NVDA --with-stage3
python scripts/run_screening.py --universe data/calibration/manifest.yaml \
    --with-peers --with-rag-signals --with-tip-annotations

# Stage 4 v2 crew (gated by Stage 2/3)
python scripts/run_crew_v2.py NVDA

# Back-validation (uses Finnhub historical by default)
python scripts/run_back_validation.py
python scripts/run_back_validation.py --limit 3 --no-write   # dry-run

# Calibration ledger analysis
python scripts/run_ledger_analysis.py                  # latest entry
python scripts/run_ledger_analysis.py --list
python scripts/run_ledger_analysis.py --compare a.json b.json

# Stage 5 value chain positioning
python scripts/run_stage5_positioning.py NVDA AMD INTC

# Stage 6 portfolio construction (HRP + bounds + cluster trim)
python scripts/run_portfolio_construction.py NVDA AMD INTC \
    --graph data/value_chain.graph.json --positions positions.yaml

# Tip log gap analysis
python scripts/run_tip_gap_analysis.py NVDA AAPL MSFT --window-days 90

Legacy state (Phases 1–4 — pre-constitution-v2.0)

Phase Status What's in it
Phase 1 MVP ✅ Complete 5 agents + quality metrics — formal GO verdict
Phase 2 ✅ 98% Economist + Steward + Defender + debate round + 3-layer audit + portfolio SQLite + auto-onboarding
Phase 3 ✅ 98% 3-Tier registry + SEC EDGAR RAG + DART + chain alerts + pre-filter stages 1–3 + dedup ledger
Phase 4 🟡 75% Paper trading ledger + auto-record on crew completion + regression-diff tool
LLM backend abstraction Pluggable backends (Ollama / OpenAI-compat / MLX / llama.cpp) + per-agent model+sampling routing via config/agent_models.yaml

Sampling policy: each agent uses its model's published recommended sampling (e.g. Qwen 2.5 → temperature=0.7/top_p=0.8). Same prompt twice can produce different outputs. See docs/llm_backends.md for the full policy discussion and how to opt back into deterministic mode per-agent.


Setup

Requirements

1. Install Ollama + models (default backend)

curl -fsSL https://ollama.com/install.sh | sh
ollama serve &

# 16k context variants used by the crew. The :7b-16k / :8b-16k tags
# are custom Modelfile aliases that raise num_ctx from the 4096
# default — Stage 3 + Defender prompts exceed 4K. Build with:
bash scripts/create_qwen_16k.sh    # qwen2.5:7b-16k (Stage 3, Analyst, Valuer, Steward)
bash scripts/create_llama_16k.sh   # llama3.1:8b-16k (Skeptic)

Want a different backend? See Alternative LLM backends — MAFIS supports Apple Silicon (MLX), GGUF (llama.cpp), and any OpenAI-compatible server (vLLM, LM Studio, mlx_lm.server, …) without changing the agent code.

2. Configure .env

Copy .env.example to .env and fill in:

FINNHUB_API_KEY=...
FRED_API_KEY=...
DART_API_KEY=...              # required only for Korean tickers
TELEGRAM_BOT_TOKEN=...        # optional
TELEGRAM_CHAT_ID=...          # optional

3. Python env

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

4. Verify

python scripts/verify_env.py      # API keys + Ollama reachable
pytest                            # should report 1150+ passed (constitution v2.0)

Daily workflow

Add a new ticker (60-min hand-authoring → 3-min auto-draft)

python scripts/onboard_ticker.py AMD --tier 2 --notes "GPU peer of NVDA"

This pulls Finnhub profile + peers, downloads the latest 10-K, indexes it into ChromaDB, drafts a value chain brief via Qwen, and registers the ticker in config/tickers.yaml. Output: docs/value_chains/AMD.draft.md — review the Vulnerable links section, then:

mv docs/value_chains/AMD.draft.md docs/value_chains/AMD.md

Korean tickers work the same way — the dispatcher detects 6-digit codes and routes through DART:

python scripts/onboard_ticker.py 005930 --tier 1  # Samsung Electronics

Run the full crew

python scripts/run_crew.py NVDA                   # US
python scripts/run_crew.py 005930                 # Korean

Output:

  • reports/<SYMBOL>_YYYYMMDD_HHMM.crew.md — six-section report + audit block
  • reports/<SYMBOL>_...meta.txt — timing / char counts / models used
  • Auto-inserted row in data/portfolio.sqlite paper-trades table
  • Optional Telegram push of the Korean summary

Inspect the portfolio

python scripts/portfolio_cli.py add NVDA --shares 10 --cost 5000 --tier 1
python scripts/portfolio_cli.py weights                  # live Finnhub quotes
python scripts/portfolio_cli.py gap NVDA --low 3 --high 5

Track paper-trade P&L over time

python scripts/paper_ledger.py list                      # all recorded verdicts
python scripts/paper_ledger.py returns                   # mark-to-market
python scripts/paper_ledger.py summary                   # win rate, audit effect

Monitor news → chain alerts

# One-off scan (prints alerts; won't fire duplicates when --dedup)
python scripts/scan_chain_alerts.py --dedup --hops 2

# Cron-friendly (with Telegram push)
0 9-16 * * 1-5  cd ~/MAFIS && /path/to/.venv/bin/python \
    scripts/scan_chain_alerts.py --dedup --telegram \
    >> /var/log/mafis_alerts.log 2>&1

Promote Tier 3 → Tier 2 based on news activity

python scripts/prefilter_scan.py --graph-context --semantic

Runs Stages 1 (keyword), 2 (value-chain context), and 3 (Qwen materiality filter) against the news pool and recommends promotions.

Validate prompt / model tweaks didn't regress quality

python scripts/regression_compare.py \
    reports/NVDA_20260424_1557.crew.md \
    reports/NVDA_20260425_0900.crew.md \
    --fail-on-regression

Architecture

src/wise_investor/
├── agents/                  # crew: analyst, valuer, skeptic, defender, steward, economist
│   ├── steward_audit.py     # discipline matrix + speculative-language + Defender-aware
│   └── runner.py            # pre_gather_facts dispatcher (US → Finnhub, KR → DART)
├── data/
│   ├── finnhub.py           # US fundamentals
│   ├── dart.py              # Korean fundamentals (OpenDART)
│   ├── dart_facts.py        # KR → crew facts adapter (KRW→USD via FRED)
│   ├── fred.py              # macro snapshot (Economist)
│   └── cross_validate.py
├── rag/
│   ├── edgar.py             # SEC EDGAR downloader + cache
│   ├── sections.py          # Business / Risk Factors / MD&A / Quant Market Risk extractor
│   ├── index.py             # ChromaDB persistent store
│   └── integration.py       # crew pre_gather hook
├── geopolitics/
│   ├── gdelt.py             # GDELT DOC 2.0 client
│   ├── google_news.py       # RSS parser
│   └── snapshot.py          # per-symbol geopolitical context
├── alerts/
│   ├── chain_alerts.py      # value-chain graph × news → target alerts
│   └── ledger.py            # SQLite dedup + cooldown
├── filters/
│   ├── pre_filter.py        # Stages 1 (keyword) + 2 (graph context)
│   └── semantic.py          # Stage 3 Qwen materiality filter
├── onboarding/
│   ├── brief_generator.py   # Finnhub + 10-K + geo → Qwen-drafted value chain brief
│   └── tickers_yaml.py      # 3-Tier registry CRUD
├── portfolio/
│   └── store.py             # positions + sizing-gap helper
├── paper_trading/
│   ├── ledger.py            # paper_trades table + performance metrics
│   └── report_parser.py     # parse Steward verdict + audit flag from crew report
├── regression/
│   └── compare.py           # structured crew-report diff tool
├── value_chain/
│   ├── graph.py             # NetworkX-backed typed DiGraph
│   └── parser.py            # docs/value_chains/*.md → graph
├── quality/
│   ├── metrics.py           # 6 automated quality scores
│   └── citation_audit.py    # edgar.* grounding + Skeptic mandate audit
└── notify/
    └── telegram.py

scripts/                     # CLI entry points for every component above
docs/value_chains/           # hand-curated + auto-drafted briefs (*.md vs *.draft.md)
data/                        # portfolio.sqlite, chroma/, edgar_cache/, facts_cache/
tests/                       # 750+ tests (offline; live ones marked -m network)

Core principles

  • Local-first, API-last: Phase 1 runs with $0 LLM spend. Finnhub / FRED / GDELT / DART are free public APIs.
  • LLM is judgment, Python is calculation: every dollar value, ratio, and growth rate is computed by src/wise_investor/tools/ or data/ and fed to the LLM as prepared facts. The LLM synthesizes narrative, never arithmetic.
  • Sampling follows model recommendations: each agent uses the sampling profile published by its model author (Qwen 2.5: 0.7/0.8; Llama 3.x: 0.7/0.9; Qwen3 thinking: 0.6/0.95/min_p=0). Two runs of the same crew may differ; the audit + citation system enforce within-run consistency, not run-to-run reproducibility. Opt back into deterministic mode per agent in config/agent_models.yaml.
  • Multi-layer audit: discipline matrix (verdict vs labels) + speculative-language detector + Defender-aware correction + edgar citation grounding + Skeptic mandate compliance. The LLM can emit any narrative; Python enforces the rules.
  • Paper trading before real trading: every Steward verdict is automatically recorded with entry price. paper_ledger.py summary tells you whether BUY verdicts actually outperform PASS verdicts over time — the only objective answer to "is this system useful?".

Telegram push (optional)

  1. Create a bot with @BotFather → copy the token.
  2. Send any message to your bot (creates the chat).
  3. Visit https://api.telegram.org/bot<TOKEN>/getUpdates → copy the chat.id.
  4. Add to .env:
    TELEGRAM_BOT_TOKEN=...
    TELEGRAM_CHAT_ID=...
    
  5. run_crew.py auto-pushes a Korean summary; scan_chain_alerts.py --telegram pushes chain alerts.

No configuration → silent skip, no errors.


Limitations

  • Korean-ticker crew runs share the English agent prompts; the Analyst will produce English analysis of Korean financials. A follow-up will branch the prompts by source country.
  • Value chain graph auto-update from 10-K text is not yet implemented. Briefs are either hand-curated or onboarding- drafted and then hand-reviewed.
  • No paper-trade position sizing — the ledger records Steward verdicts only; actual position sizing per trade is manual.
  • No OpenClaw integration (design §8.1); Telegram covers the equivalent role.

See docs/MVP_EVALUATION.md for the Phase 1 formal evaluation and Phase 2+ priorities.

About

A value-investor's AI research assistant. Local LLMs do the judgment, Python does the math. Every number in every report is traceable to source.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors