CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

See .claude/CLAUDE.md for full workflow conventions, commit protocol, privacy rules, and pipeline run commands.

Ecosystem

Part of the Korean forensic accounting toolkit.

Hub: ../forensic-accounting-toolkit/ | GitHub
Task board: https://github.com/users/pon00050/projects/1
Role: Platform (integrates all foundation and analysis libraries)
Depends on: kr-trading-calendar, kr-forensic-core, kr-dart-pipeline, kr-anomaly-scoring, kr-stat-tests
Consumed by: end users (CLI, FastAPI, MCP server)

Common Commands

# Install / sync
uv sync --extra dev

# Run tests
uv run pytest tests/ -v

# Run a single test
uv run pytest tests/test_pipeline_invariants.py::test_function_name -v

# Full pipeline (smoke test)
uv run python 02_Pipeline/pipeline.py --market KOSDAQ --start 2021 --end 2023 --sample 5 --sleep 0.1

# Re-run everything including analysis
krff refresh --sample 5

# Status / audit
krff status -v
krff audit --verbose

`krff/` Module Map

Module	Purpose
`_paths.py`	`PROJECT_ROOT`, `PROCESSED_DIR` constants; overridable via `KRFF_PROJECT_ROOT` / `KRFF_DATA_DIR` env vars
`constants.py`	Shared thresholds: `BENEISH_THRESHOLD = -1.78`, CB/BW flag names, `PRICE_WINDOW_TRADING_DAYS = 60` (used by `03_Analysis/_scoring.py` only, not CLI-exposed)
`db.py`	In-memory DuckDB over parquet files; `query()`, `read_table()` — no persistent `.duckdb` file
`data_access.py`	`load_parquet()`, `load_csv()`, `load_company_name()` — reusable loaders extracted from `report.py`
`models.py`	Pydantic models documenting dict-returning function contracts; not yet enforced at runtime
`audit.py`	DAG-based freshness checker; `get_audit()` → used by `krff audit`
`stats_runner.py`	14-node stats DAG; `get_stats_audit()` → used by `krff stats`
`analysis.py`	Thin wrapper: loads `beneish_scores.parquet` for `krff analyze`
`status.py`	Artifact inventory formatter for `krff status`
`quality.py`	Null-rate / coverage gap checker for `krff quality`
`report.py`	Per-company HTML report generator; calls Claude API for narrative synthesis
`mcp_server.py`	FastMCP server (10 tools); mounted at `/mcp/` by `app.py`; all tools must return via `mcp_utils.sanitize_for_json()`
`mcp_utils.py`	JSON serialization helpers: converts numpy types, `NaN`, `pd.Timestamp` — required for all MCP tool returns
`charts.py`	Beneish visualizations → `beneish_viz.html`
`pipeline.py`	Lazy proxy to `02_Pipeline/pipeline.py`; avoids heavy imports at CLI startup
`review.py`	Review queue CRUD (SQLite); used by `krff queue/surface/hide/assess`

Known Gaps

Gap	Why	Status
Phase 3 stubs (`monitor/dart_rss.py`, `cli.py monitor/alerts`, `app.py alerts/monitor`)	Phase 3 not started	Deferred — Phase 3
Pydantic models in `krff/models.py` not enforced at runtime	Functions return raw dicts; FastAPI validates at serialization boundary only	Deferred — Phase 3
`extract_bondholder_register.py:91` — pagination missing	Only fetches first page of CB filings (>100 not handled)	Unblocked — low priority
3 standalone extractors not wired into `pipeline.py`	`extract_bondholder_register`, `extract_depreciation_schedule`, `extract_revenue_schedule` produce parquets but can't be triggered via `krff refresh`	By design — standalone scripts
WICS sector data has no historical snapshots (`extract_dart.py:157`)	WICS serves recent dates only; joins to historical financials use current sector	By design — upstream limitation
`tests/conftest.py` appends kr_dart_pipeline dir to sys.path for bare `import transform`	`test_pipeline_invariants.py` uses bare module names (4 locations); updating to absolute imports requires test refactor	Unblocked — low priority
`02_Pipeline/` and `03_Analysis/` are monolith copies of kr-dart-pipeline / kr-anomaly-scoring	Needed for `krff/pipeline.py` proxy and test bare imports; divergence from canonical packages is a silent risk	Deferred — remove after test refactor

Architecture Shape

Two separate entry points both read from the same parquet layer:

02_Pipeline/pipeline.py  →  01_Data/processed/*.parquet  ←  krff/db.py
                                                          ←  krff/data_access.py
03_Analysis/run_*.py     →  03_Analysis/*.csv

app.py (FastAPI)  →  krff/mcp_server.py  →  10 tools over parquets + CSVs
cli.py (krff)     →  krff/* thin wrappers

krff/db.py is the canonical query layer. Do not read parquets directly with pd.read_parquet in new code — use db.read_table() or data_access.load_parquet() so path resolution and DuckDB registration are consistent.

Working notes (regulatory analysis, legal compliance research, or anything else not appropriate for this public repo) belong in the gitignored working directory of the coordination hub. Engineering docs (API patterns, test strategies, run logs) stay here.

NEVER commit to this repo

This repository is public. Before staging or writing any new file, check the list below. If the content matches any item, route it to the gitignored working directory of the coordination hub instead, NOT to this repo.

Hard NO list:

Any API key, token, or credential — even a truncated fingerprint. This includes Anthropic key fingerprints (sk-ant-...), AWS keys (AKIA...), GitHub tokens (ghp_...), DART/SEIBRO/KFTC API keys, FRED keys. Even partial / display-truncated keys (e.g. "sk-ant-api03-...XXXX") leak the org-to-key linkage and must not be committed.
Payment / billing data of any kind. Card numbers (full or last-four), invoice IDs, receipt numbers, order numbers, billing-portal URLs, Stripe/Anthropic/PayPal account states, monthly-spend caps, credit balances.
Vendor support correspondence. Subject lines, body text, ticket IDs, or summaries of correspondence with Anthropic / GitHub / Vercel / DART / any vendor's support team.
Named third-party outreach targets. Specific company names, hedge-fund names, audit-firm names, regulator-individual names appearing in a planning, pitch, or outreach context. Engineering content discussing Korean financial institutions in a neutral domain context (e.g. "DART is the FSS disclosure system") is fine; planning text naming them as a sales target is not.
Commercial-positioning memos. Documents discussing buyer segments, monetization models, pricing strategy, competitor analysis, market positioning, or go-to-market plans. Research methodology and technical roadmaps are fine; commercial strategy is not.
Files matching the leak-prevention .gitignore patterns (*_prep.md, _billing, _outreach, _strategy, _positioning, _pricing, _buyer, _pitch, product_direction.md, etc.). If you find yourself wanting to write a file with one of these names, that is a signal that the content belongs in the hub working directory.

When in doubt: put the content in the hub working directory (gitignored), not this repo. It is always safe to add later. It is expensive to remove after force-pushing — orphaned commits remain resolvable on GitHub for weeks.

GitHub Push Protection is enabled on this repo and will reject pushes containing well-known credential patterns. That is a backstop, not the primary defense — write-time discipline is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Ecosystem

Common Commands

`krff/` Module Map

Known Gaps

Architecture Shape

NEVER commit to this repo

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Ecosystem

Common Commands

krff/ Module Map

Known Gaps

Architecture Shape

NEVER commit to this repo

`krff/` Module Map