Roadmap

Milestones

#	Milestone	Status	Script
1	Beneish M-Score screen	Complete	`beneish_screen.py`
2	CB/BW timelines	Implemented, runnable — output quality improved (session 61)	`cb_bw_timelines.py`
3	Timing anomalies	Implemented, runnable — output quality improved (session 61)	`timing_anomalies.py`
4	Officer network graph	Implemented, runnable — output quality improved (session 61)	`officer_network.py`

Phase 2 Data (extracted)

Table	Description
`cb_bw_events.parquet`	CB/BW issuance events from DART DS005; 11 cols including issue_amount, maturity_date, refixing_floor, board_date, warrant_separable
`price_volume.parquet`	OHLCV ±60 day windows around events
`corp_ticker_map.parquet`	corp_code ↔ ticker mapping
`officer_holdings.parquet`	Officer holding changes
`disclosures.parquet`	DART filing listings — 921 corps / 271,504 rows (expanded from 58 corps in session 31)
`major_holders.parquet`	5%+ ownership threshold filings
`bondholder_register.parquet`	CB bondholder names from 사채권자명부
`revenue_schedule.parquet`	Revenue by customer/segment from 매출명세서
`bond_isin_map.parquet`	1,859 validated bond ISINs / 656 corp_codes via FSC API (dataset 15043421); required by SEIBRO StockSvc extractor

Codebase Cleanup — Completed (Sessions 34–35)

Session 34: 22 issues across 3 phases (bugs/security, performance, consolidation).

Phase	Scope	Key changes
A (bugs/security)	4 items	ServiceKey casing fix (KI-022), DuckDB SQL escaping (KI-023), narrowed except blocks, file handle leak
B (performance)	5 items	Pre-grouped price lookups (~900M comparisons eliminated), lazy WICS probe, DataFrame concat, cached parquet reads, lazy plotly import
C (consolidation)	5 items	4 duplicate functions → `_pipeline_helpers.py`, DART status constants, `src/constants.py`, `src/_paths.py`, removed 13 redundant `sys.path.insert`

Session 35: 3 phases addressing remaining structural issues.

Phase	Scope	Key changes
D (constants adoption)	3 files	8 flag literals + 3 threshold literals → `src/constants.py` imports
E (scoring extraction)	3 files	~150-line scoring logic deduplicated into `03_Analysis/_scoring.py`; fixed Marimo missing `flag_count` + conditional `peak_date` (KI-025)
F (loader consolidation)	2 files	7 report loaders → 2 generic + 2 special; removed dead `_load_financials()`; fixed double beneish parquet read

168 tests pass. See CHANGELOG.md and KNOWN_ISSUES.md KI-022 through KI-025 for full details.

What's Next

SEIBRO repricing data — API key registered but inactive (resultCode=99, KI-012). Blocking permutation_repricing_peak.py and survival_repricing.py. SEIBRO provides the only structured source of per-event repricing dates/prices and exercise batches (Flags 1 & 2). For commercial use, KSD 별도이용허락 required (call 051-519-1420). See doc 54 (54_DART_SEIBRO_Technical_Comparison.md) for verified signal-by-signal assessment — DART cannot replace SEIBRO for these signals.
Populate paid-tier tables — run paid-tier extractors at scale for flagged companies
Statistical analysis layer — 10 ISL-grade scripts written; S1–S5 complete (session 24); findings in FINDINGS.md

Statistical Analysis — Completed (Session 25)

ID	Description	Outcome
S9	Cross-screen: PC3 top-decile × flagged CB/BW events	170 double-flagged company-years; 143 unique secondary companies (updated session 33 with holdings_flag live); 8 high-priority secondaries (PC3≥95th AND flag_count≥2) — was 0 in all prior runs; top lead: 캔버스엔 (00550082, PC3_rank 0.9984)
S8	Run `extract_depreciation_schedule.py` for 5 Tier 1 leads	All 15 rows = parse_error or no_filing; DART sub_docs keyword matching returns wrong table type for these companies; Category 20 tests flip from 8 skipped → 8 passed; FINDINGS.md §4 updated with root cause
S10a	Extract disclosures for 50 unflagged control companies	`disclosures.parquet` expanded from 8 → 58 corp_codes (3,581 → 27,486 rows; +23,905 control rows)
S10b	Rebuild FDR null from control disclosures × price data	Control null: 2,000 quiet events; 687/687 test events trivially survive BH — KI-021 diagnosed: pre-filtering makes any clean null give p≈0; valid test requires unfiltered input → S11
S11	Proper FDR disclosure leakage test (fixes KI-021)	`fdr_disclosure_leakage.py` written; 2/822 events survive BH at q=0.043 — 피씨엘 2021-01-18 (+287%) and 프로브잇 2021-06-14 (+143%); p-value distribution shows mild enrichment near 0; KI-021 RESOLVED. Revised in session 31 (disclosures expanded 58→921 corps): 0/822 survivors — previous 2-survivor result was artifact of weak null (50 corps); with 811 control corps the signal doesn't survive BH; mild p-value enrichment near 0 persists (72 vs 41 expected)

Statistical Analysis — Completed (Session 24)

ID	Description	Outcome
S1	Fix `cluster_peers.py` z-score contamination (KI-020)	50 cluster-relative flags (was 0); KI-020 resolved
S2	Investigate 김형석 and 박정우	Confirmed 4 and 2 flagged companies respectively; no Tier 1 lead overlap; 박정우 confirmed as 전무이사 at 우리기술 with CB acquisition; see `FINDINGS.md` §5a
S3	Redesign FDR null distribution	timing_anomalies.csv pre-filtered (all extreme events); clean null requires full disclosures.parquet join — new blocker documented
S4	PC3 as alternative manipulation screen	531 top-decile company-years; 6 of 18 Tier 1 lead company-years in top decile; `pca_pc3_scores.csv` output added
S5	Depreciation extractor for Tier 1 leads	`extract_depreciation_schedule.py` written; Category 20 schema test added; ready to run

Statistical Analysis — Remaining Action Items

Completed (Session 38)

ID	Description	Outcome
—	Session 39: label expansion + blind spot docs	아스트 (FSC 22억 fines) + 휴림로봇 (검찰 기소) fraud=1; labels 28→30 (17 fraud=1); bootstrap −1.85 (CI [−2.85, −0.90]); RF AUC 0.738±0.201; FINDINGS.md §10 blind spots detailed; KI-026 (refresh --sample destructive)
—	krff reports for 8 high-priority secondaries	Generated 8 HTML reports (캔버스엔, 스피어, 알티캐스트, 아스트, 라닉스, 휴림로봇, 엑시온그룹, 유일에너테크); all dual-flagged (holdings_decrease + volume_surge); SEIBRO still resultCode=99 (day 4)
—	Label coverage analysis	`label_coverage_analysis.py` written; 13/14 Beneish (93%); 10/14 CB/BW (71%); 6/14 dual (43%); 에코앤드림 Beneish blind spot; §10 in FINDINGS.md
—	Label expansion — 알티캐스트	Web search confirmed CEO 서정규 배임 기소 2023-12-19 (특경법); added as fraud=1; labels 27→28 (15 fraud=1); bootstrap −1.75 stable; RF AUC 0.740; TATA −0.101
A1	Automate recurring data refresh	`krff refresh` command added to `cli.py`; 6-stage wrapper; `--sample` + `--skip-analysis` flags; 168 tests pass
A2	Pipeline freshness checker	`krff audit` command added; DAG encodes 6 stages; detects stale outputs via mtime comparison; propagates staleness downstream; 7 new tests; 230 total pass
A3	Statistical test orchestrator	`krff stats` command added; `STATS_DAG` (14 nodes); `--dry-run` + `--verbose`; 8 new tests; 250 total pass — Complete (Session 62)

Completed (Session 36)

ID	Description	Outcome
S13	Expand `labels.csv` with confirmed Korean fraud cases	15→27 labels; 4 new fraud=1 (초록뱀그룹 CB배임 cases + 셀리버리); 8 new fraud=0. Bootstrap threshold: −0.75→−1.75 (near US −1.78). RF AUC: 0.670→0.786 ± 0.182. TATA negative coefficient confirmed as stable KOSDAQ pattern. FINDINGS.md §9 added.
S14	Pipeline validation against confirmed fraud companies	All 4 confirmed fraud companies caught by M-score (≥1 year above −1.78). 초록뱀그룹: flag_count=1 (volume_surge). 셀리버리: flag_count=0 (disclosure fraud, not CB abuse). Cross-Script Synthesis updated. SEIBRO still resultCode=99.

Ready now (no external dependencies)

(none — all non-blocked items complete)

Completed (Session 28)

ID	Description	Outcome
S6a	Run `build_isin_map.py --sample 50`	0 ISINs found — DART CB/BW filings don't contain bond ISINs; approach invalid; need KRX/SEIBRO alternative
S7	Expand `labels.csv` to ≥10 rows; run 3 blocked scripts	15 labels (10 fraud=1, 5 fraud=0); bootstrap median=-0.75 (CI [-2.55,-0.50], US -1.78 inside); Lasso: DSRI/TATA/SGI/GMI active; RF AUC=0.670

Completed (Session 26)

ID	Description	Outcome
S12	Fix `extract_seibro_repricing.py` (4 endpoint/param bugs + ISIN join key); write `build_isin_map.py`	extractor now uses StockSvc/getXrcStkOptionXrcInfoN1 + getXrcStkStatInfoN1 with bondIsin param; `build_isin_map.py` extracts ISINs from DART CB filings via regex
S12b	Probe `extract_seibro.py` websquare endpoints	All 4 return HTML shell (545 chars, JS redirect) — WebSquare requires browser session; superseded by data.go.kr REST API

Completed (Session 29)

ID	Description	Outcome
S6a	Populate `bond_isin_map.parquet`	RESOLVED. DART approach failed (ISINs not in filings); switched to FSC 금융위원회 채권발행정보 API (dataset 15043421, `getIssuIssuItemStat`). Full run (session 30): 2,718 ISINs across 685 corp_codes (of 919 queried). All 5 Tier 1 leads have ISINs.

Blocked (external dependencies)

ID	Description	Blocked by
S6	Run `extract_seibro_repricing.py` → re-run `permutation_repricing_peak.py` + `survival_repricing.py`	SEIBRO API key activation only

Statistical Layer Methodological Fixes — Session 64

Three methodological problems in the supervised statistical layer fixed (session 64, Mar 9 2026). Findings remain directional until scripts are re-run with new CI values recorded.

#	Problem	Fix	Scripts
1	Row-level bootstrap inflated effective n, narrowed CI artificially	`cluster_bootstrap_sample()` helper — resamples at company level	`bootstrap_threshold.py`
2	`auto_controls()` used `m_score < -2.5` to pick controls — same metric being calibrated	External criteria only: no CB/BW events, ≥3 scoreable years, neutral sort	all 3 scripts
3	Standard k-fold / LOO mixed same-company years across train/test	`GroupKFold(corp_code)` with `groups=` argument	`lasso_beneish.py`, `rf_feature_importance.py`

8 new invariant tests added (261 total). Re-run scripts to update calibration values in MEMORY.md.

Chapter 6 Methodological Additions — Session 65

Two Chapter 6 (ISL) outputs added to lasso_beneish.py (session 65, Mar 9 2026).

#	Addition	Output	Detail
4	Regularization path	`lasso_path.csv` + `lasso_path.png`	`compute_lasso_path()` helper using `sklearn.lasso_path()`; 100-alpha × 8-component matrix; Plotly line chart; shows entry order as penalty decreases
5	EPV check	EPV printed at startup; `epv` column in `lasso_coefficients.csv`	`events_per_variable()` helper; 17/8 = 2.1 (below accepted minimum 10–15); WARNING printed when EPV < 10

6 new invariant tests added (267 total). lasso_beneish.py re-run required to generate lasso_path.csv / lasso_path.png.

Chapter 8 Methodological Additions — Session 66

Two Chapter 8 (ISL — Tree-Based Methods) outputs added to rf_feature_importance.py (session 66, Mar 9 2026).

#	Addition	Output	Detail
6	Impurity vs. permutation importance comparison	`rf_importance_comparison.csv` + `rf_importance_comparison.png`	`compare_importance_methods()` helper; returns DataFrame with rf_rank, perm_rank, rank_divergence; scatter plot (rf_rank vs perm_rank) shows method agreement/disagreement; pure function — takes precomputed arrays
7	RF EPV check	EPV printed at startup; `epv` column in `rf_importance.csv`	`rf_events_per_variable()` helper; denominator is dynamic feature count (not fixed 8); WARNING when EPV < 10 (expected: 17/~15 ≈ 1.1)

9 new invariant tests added (276 total). rf_feature_importance.py re-run required to generate rf_importance_comparison.csv / rf_importance_comparison.png.

Deferred Ch. 8 additions (future session):

OOB score as bootstrap-consistent internal estimate (oob_score=True in RandomForestClassifier)
Importance stability across seeds (Spearman rank correlation across N random_state values)

Chapter 12 Methodological Additions — Session 67

Two Chapter 12 (ISL — Unsupervised Learning) outputs added (session 67, Mar 9 2026).

#	Addition	Script	Output	Detail
8	PCA loading interpretation	`pca_beneish.py`	`pca_top_loadings.csv`	`pca_top_loadings()` helper; pure function; top-3 features per PC by
9	GMM AIC/BIC vs k-means	`cluster_peers.py`	`cluster_gmm_fit.csv`	`gmm_aic_bic()` helper; fits GMM at K_VALUES=[6,8,10]; reports AIC/BIC per k; directly addresses Problem 6 (k-means wrong for hyperspherical data)

6 new invariant tests added (282 total). Both scripts need re-run to generate new outputs.

Chapter 13 Methodological Additions — Session 68

Two Chapter 13 (ISL — Multiple Testing) outputs added (session 68, Mar 9 2026).

#	Addition	Script	Output	Detail
10	Storey π₀ estimator	`fdr_timing_anomalies.py`	`fdr_timing_summary.csv`	`pi0_estimate()` helper; Storey (2002) method; `#{p>λ}/(m·(1−λ))`; clipped to [0,1]; reports proportion of true nulls and expected false discoveries; 1-row summary CSV
11	Bonferroni vs BH comparison	`fdr_disclosure_leakage.py`	`fdr_bonferroni_compare.csv`	`bonferroni_compare()` helper; ISL §13.3 (FWER) vs §13.4 (FDR); per-test rejection flags for both methods; agreement column; BH guaranteed ≥ Bonferroni rejections

6 new invariant tests added (288 total). Both scripts need re-run to generate new outputs.

Output Quality Issues — Session 62 Review

Identified by /review-pipeline on 2026-03-08. Address before next statistical test run.

Priority	ID	Issue	Impact	Status
High	OQ1	SEIBRO still inactive (resultCode=99, KI-012) — Flags 1 & 2 (repricing_below_market, exercise_at_peak) are effectively off; near-zero True counts in `repricing_flag` and `exercise_cluster_flag`	CB/BW screen ~50% blind; 756 flagged events are almost entirely single-flag volume surges	Blocked on API key activation
High	OQ2	FDR timing p-values all = 0.0001 (floor) — `fdr_timing_anomalies.csv` shows `p_value.nunique() == 1`; the FDR computation is not producing meaningfully different values across events	`fdr_timing_anomalies.py` result is statistically invalid	Resolved (session 63) — removed `quiet_mask` filter; null now uses all control events; p-value range restored
Medium	OQ3	`gap_hours` = 2.5 for all rows in `timing_anomalies.csv` — fixed constant, not computed per-event	Timing anomaly methodology understates precision; all events treated as identical lag	Resolved (session 63) — gap_hours now 2.5 for same_day, 15.0 for prior_day
Medium	OQ4	Bootstrap F1 plateau — F1 range ≈ 0.04 across the full −3.5 to 0.0 threshold sweep; no sharp elbow	Threshold selection (−2.45 / −1.78) is weakly determined by the data; note as calibration limitation in any writeup	Acknowledged limitation — SEIBRO activation marginally narrows F1 CI; expanding labels (OQ6) partially helps; with only 30 labels and 1,700+ companies no threshold produces a sharp peak. Document as calibration caveat in any writeup. Not a code fix.
Low	OQ5	Cluster silhouette ≈ 0.25 (k=6,8,10) — no natural cluster structure in KOSDAQ Beneish data	k-means may not be appropriate; consider DBSCAN or hierarchical clustering, or report as null result	Acknowledged finding — silhouette ≈ 0.25 across k=6,8,10; no natural cluster structure. Try DBSCAN or hierarchical clustering as alternatives; if they confirm weak structure, report as a market-specific null result. Not a data quality issue.
Low	OQ6	Label set tiny (30 labeled + 20 auto-controls) — RF AUC CI = ±0.192	Model reliability caveat; expand labels before publishing RF/Lasso findings	Open

SEIBRO dependency matrix:

Issue	SEIBRO fixes it?	Can fix now?
OQ1	Yes, directly	No — external
OQ2	No	Yes ✓
OQ3	No	Yes ✓
OQ4	Marginally	Partially (via more labels)
OQ5	No	Partially (try other algorithms)
OQ6	No	Yes, but labor-intensive

Phase 3 — Continuous Monitoring

ID	Description	Status
M1	Event-driven re-scoring on new regulatory filings	Planned
M2	Market surveillance signal integration	Planned
M3	Regulatory enforcement feed and automated evidence staging	Planned

Phase 3 extends the pipeline from periodic batch processing to continuous monitoring. Detection runs incrementally as new data arrives rather than on a fixed schedule, reducing time-to-signal from weeks to hours. Full specification in internal documentation.

Phase 3 prerequisites

ID	Description	Status
P1	DuckDB analytics layer (`src/db.py`): connection factory over existing parquet files	Complete (Session 46)
P2	Pydantic models for alerts/monitoring (`AlertEvent`, `MonitorStatus`, `AlertList`)	Complete (Session 46)
P3	Monitor package skeleton (`02_Pipeline/monitor/`)	Complete (Session 46)
P4	CLI stubs (`krff monitor`, `krff alerts`)	Complete (Session 46)
P5	API stubs (`/api/alerts`, `/api/monitor/status`)	Complete (Session 46)
P6	Alert schema + SQLite operational state	Planned (deferred until M1 needs persistent state)
P7	Label candidates schema for automated staging	Planned

MCP Server — Complete (Session 80, Mar 11 2026)

10-tool MCP server exposing all pipeline data to AI clients (Claude Code, Claude Desktop, any MCP-compatible agent). Implemented as an additive layer on the existing FastAPI app — mounts at /mcp/, same process, same port.

Tool	Data source	Description
`lookup_corp_code`	`corp_ticker_map.parquet`	Name/ticker → corp_code (always first)
`get_company_summary`	all parquets + CSVs	All signals aggregated for one company
`get_beneish_scores`	`beneish_scores.parquet`	M-Score history with 8 components
`get_cb_bw_events`	`cb_bw_summary.csv`	CB/BW events with flags
`get_price_volume`	`price_volume.parquet`	OHLCV with pagination
`get_officer_holdings`	`officer_holdings.parquet`	DART holding changes
`get_timing_anomalies`	`timing_anomalies.csv`	Disclosure timing anomalies
`get_major_holders`	`major_holders.parquet`	5%+ block-holding filings
`get_officer_network`	`centrality_report.csv`	Cross-company officer centrality
`search_flagged_companies`	`beneish_scores.parquet`	Ranked anomaly screen with pagination

Files added: src/mcp_utils.py, src/mcp_server.py, .mcp.json, tests/test_mcp_server.py Files modified: app.py (3-line MCP mount), pyproject.toml (fastmcp>=3.1.0, pytest-asyncio>=0.23.0) Tests: 301 pass. Connect via: claude mcp add --transport http kr-financial-statements http://localhost:8000/mcp/

Phase 3 engineering prerequisites (before Phase 4 website)

FastAPI readiness refactoring — Complete (Session 43). src/data_access.py (reusable loaders), src/models.py (Pydantic response shapes), env var config overrides, public API functions (get_company_summary, get_report_html). All scoring constants consolidated in src/constants.py. A developer can now write from src.report import get_company_summary in a FastAPI endpoint.
FastAPI HTTP layer — Complete (Session 44). app.py (6 endpoints: /api/status, /api/quality, /api/companies/{corp_code}/summary, /api/companies/{corp_code}/report, /api/alerts, /api/monitor/status); krff serve CLI command (uvicorn-backed); Typer input validation on all commands (run, report, refresh); try/except error wrapping on all commands; fastapi>=0.115.0 + uvicorn[standard]>=0.30.0 added to deps. Start with krff serve → Swagger UI at http://127.0.0.1:8000/docs.
DuckDB integration — Complete (Session 46). src/db.py (connection factory, parameterized queries over parquet); data_access.py and quality.py migrated to DuckDB internals; no data migration needed.
Minimal orchestrator: Poll → Normalize → Dedup → Dispatch → Execute → Publish → Log
SQLite operational state (deferred until M1 needs persistent job/alert state). Analytics stays in parquet/DuckDB — operational state in SQLite when activated.

Multi-user readiness gate (deliberately deferred until DB is integrated)

The current app.py is correct and complete for single-analyst use. The following issues are not bugs today but must be resolved before the API serves multiple simultaneous users. They are structurally solved by the DuckDB + SQLite integration above — listed here for reference.

Typer CLI — minimal changes needed:

Concurrent krff run writes to shared 01_Data/processed/ will race. Each user must set their own KRFF_DATA_DIR env var (already supported via src/_paths.py).
DART API key exhaustion (20K req/day) is per key. Multiple users on one key will collide. Fix: separate keys per user, or a shared rate-limiting wrapper.

FastAPI — resolved by DB integration:

get_quality() loads every parquet in full on every /api/quality request. Acceptable for one analyst; under concurrent load, multiple full-DataFrame reads spike memory and latency. Fix: cache with TTL, or restructure to read only PyArrow parquet footer statistics (no full DataFrame load needed for null counts).
All routes are sync def, run in FastAPI's default thread pool (min(32, cpu_count + 4) threads). Under concurrent disk-heavy requests, threads queue. Fix: switch to async def + asyncio.to_thread() for disk reads, or set explicit threadpool_size in uvicorn config.
No authentication. Any host that can reach the port can call any endpoint. Fix: API key header middleware (simple), or OAuth (public-facing).
Once SQLite is the operational layer, get_status() and get_quality() can read from DuckDB views + SQLite state instead of live parquet scans, eliminating both the latency and the caching problem.

Phase 4 — Public Website (ultimate goal)

Institutions consume signals and reports in a familiar web interface. No code execution required from end users — they read, not operate.

ID	Description	Status
W1	FastAPI backend + frontend shell + legal pages + deploy configs	Implementation complete (Session 79) — public/paid tier routing added — pending go-live
W2	Static or server-rendered public website	Planned
W3	Company pages with signal history and report links	Planned
W4	Alert feed with severity levels and source links	Planned
W5	Admin review layer (false-positive flagging, label staging)	Planned
W6	Natural language query interface (DuckDB + LLM → SQL)	Planned
W7	AI-agent-only content model (agents post signals/summaries autonomously)	Planned

W1 — Public Demo MVP

Goal: credibility signal for institutional discovery, not traffic. Success KPI: 1 institutional conversation (자산운용사 리스크팀, 증권사 리서치팀, or 회계법인 감사팀).

Implementation complete (Session 78–79). Files created/modified:

app.py — rewritten: CORS, Jinja2, StaticFiles, TTL cache, async routes, lifespan preload, 8 new web routes; Session 79: _public_corps, _classify_corp(), index SQL filter, report_clean branch, demo corp_name fix
static/css/main.css — extracted from mockup.html
templates/ — base, index, demo, about, contact, report_shell, privacy, terms; Session 79: report_clean.html (new)
deploy/Dockerfile, deploy/Caddyfile, deploy/krff-api.service
pyproject.toml — added jinja2>=3.1.0, cachetools>=5.3.0, httpx>=0.24.0 (dev); uv.lock updated

Public/paid tier (Session 79):

PUBLIC_CORPS env var (.env) — comma-separated corp_codes for public allowlist; empty = cold-start mode (all flagged companies accessible)
_classify_corp() — routes /report/{corp_code} to report_shell.html (flagged) or report_clean.html (clean) or 404 (not in allowlist)
User fills in PUBLIC_CORPS after curating sample; paid-tier auth middleware is additive over existing _flagged_corps universe

Before go-live checklist (blocking):

Web3Forms access key — get from web3forms.com; set as WEB3FORMS_KEY env var on server (currently placeholder YOUR_ACCESS_KEY in template)
Operator name — replace [OPERATOR_NAME] in templates/privacy.html (개인정보 보호책임자 성명)
Contact email — replace [CONTACT_EMAIL] in templates/base.html, privacy.html, terms.html, contact.html
Domain name — replace krff.example.com in deploy/Caddyfile; update CORS allow_origins=["*"] → real domain in app.py
Demo companies B and C — set DEMO_CORPS env var with additional corp_codes once legal review is complete (currently only 00550082 캔버스엔)
Hetzner CX33 provisioning — create server, install Caddy + systemd, copy deploy/krff-api.service and deploy/Caddyfile, mount parquet volume
Smoke test — curl https://krff.example.com/ returns HTML; /api/status returns JSON; /demo/00550082/report renders iframe; /privacy and /terms render

Non-blocking (post-launch):

Restrict CORS origin from ["*"] to real domain
KSD 별도이용허락 (051-519-1420) — required for commercial SEIBRO use; activates Flags 1 & 2

W2 — Content Marketing

Platform sequencing: website → video → LinkedIn → Substack/브런치 → institutional outreach.

YouTube demo video (KR primary: "대한민국 상장사 1700개를 자동으로 감시하는 시스템"; EN secondary)
LinkedIn post with website + video links (institutional discovery)
Substack or 브런치 methodology article ("KOSDAQ Accounting Anomaly Study 2019–2024")
"Top 20 Anomaly Companies" free sample report (drives inbound; excludes Tier 1 leads)
Pilot offer: 3-month free/discounted license → reference case → case study

Design principles:

Frontend reads only published state from operational DB (atomic publish pattern)
Public language: "signal", "anomaly", "pattern" — never "fraud confirmed" or "criminal"
Infrastructure works without AI; AI enhances triage and summarization but does not gate the pipeline

W6 — Natural Language Query Interface

Goal: Let users query the corpus in plain Korean or English without knowing DuckDB or the parquet schema.

Architecture: User types "Show me everything suspicious about 피씨엘 in 2021" → LLM translates to DuckDB SQL against parquet files → results returned as structured JSON → rendered as evidence packet (signals + citations).

Implementation path:

DuckDB connection factory already exists in src/db.py
LLM routing: claude-haiku-4-5 for query classification + SQL generation; claude-sonnet-4-6 for result synthesis
Schema context: Pass parquet column names + sample rows in system prompt with cache_control: ephemeral
Safety: SQL is read-only; no write access to parquet; all queries parameterized
Fallback: If LLM-generated SQL fails validation, return a structured error (not a crash)

W7 — AI-Agent-Only Content Model

Goal: The public website generates its own content autonomously. Human posts nothing — agents post signals, summaries, and anomaly alerts on a schedule.

Concept: The website is not a blog. It is a continuously updated surveillance feed. A publisher agent runs weekly, checks for new flags (new Beneish threshold crossings, new CB/BW events, new timing anomalies), generates a 200-word summary per new signal in safe language (no fraud allegations, source citations only), and posts it to the website without human review.

Human role: Review and suppress (not review and approve). The adversary/refutation agent runs first — it tries to find benign explanations. If no benign explanation exists, the publisher agent posts. If the adversary agent finds a benign explanation, the signal is held for human review.

Why this matters: Collapses the analyst team requirement to zero for routine updates. One person maintains the infrastructure; agents maintain the content. This is the "self-updating intelligence platform" model — the system generates its own distribution.

Multi-agent design (Phase 4 target):

Ingestion/triage agent — classify relevance, identify corp_code
Analysis operator agent — call existing scripts via fixed action menu
QA/validation agent — check output completeness and publish-safety
Publisher agent — generate website-ready summaries (signal language only)
Adversary/refutation agent — actively find benign explanations; challenge severity before publication

Phase 5 — Data Expansion and Spatial Analysis

Identified March 2026. Not sequenced — depends on Phase 3/4 completion and SEIBRO activation.

5A — Additional Korean Open Datasets

Eight Korean public datasets identified as high-value for pipeline enrichment. See 00_Reference/2_Data/60_Additional_Korean_Open_Datasets.md for full detail, access paths, and integration priority.

Dataset	Signal added	Blocking dependency
FSC enforcement actions	Ground-truth label expansion	None — scraping
KONEPS government procurement	Revenue quality, political connection	ID bridge (사업자등록번호)
Court insolvency filings	Model validation (ex-post)	None
KoTaP tax avoidance panel	Earnings management enrichment	Academic access
Korea Customs trade data	Fake export revenue detection	Commercial contract required — public APIs aggregate only; no company-level data
KRED macro panel	Cyclicality research	None
KOSIS regional statistics	Spatial analysis (prerequisite)	Geocoding
KOFIA bond data	CB/BW complement	None

Key infrastructure prerequisite: Unified corporate identifier table linking corp_code ↔ stock_code ↔ 사업자등록번호 ↔ 법인등록번호 ↔ ISIN. The 사업자등록번호 is available in DART company profiles (company.json) and can be extracted in bulk.

5C — Officer Network Dataset (Standalone)

Publish officer_network_panel.parquet as a standalone open dataset: all DART officer holding disclosures, normalized and deduplicated, with entity resolution across name variants. Enables any researcher to reconstruct the officer-company graph without running the full pipeline.

Value: No equivalent dataset exists publicly for Korean markets. Cross-company officer tracking is the data primitive underlying the NTS ₩260B manipulation network recovery (March 2026).

5D — Historical Ticker Change Dataset

Publish a full history of KOSDAQ/KOSPI ticker changes: relisting events, SPAC mergers, name changes with effective dates. Currently corp_ticker_map.parquet has point-in-time data; this extends it to a temporal table.

Value: Without ticker change history, any time-series analysis that joins on ticker will silently misattribute data across corporate events. No free Korean equivalent exists. Essential for any researcher doing multi-year event studies on KOSDAQ.

5E — Corporate Action Timeline

Publish a normalized corporate action table: CB/BW issuances, rights offerings, stock splits, reverse splits, tender offers — all from DART filings, joined to price/volume data with ±60 day windows. One row per event, normalized event type, consistent date formatting.

Value: Currently cb_bw_events.parquet covers CB/BW only. A full corporate action timeline enables any capital markets researcher to run event studies without building the DART extraction layer from scratch.

5F — KONEPS Procurement Exposure Dataset

For all KOSDAQ companies with a BRN bridge (extractable from DART company.json), query KONEPS (나라장터) procurement records and publish the results: total government contract value per company per year, contract types, counterparty agencies.

Value: Enables fake government revenue detection. A company reporting rapidly growing government revenue that does not appear in KONEPS records is a high-priority anomaly. Requires BRN extraction (one field from DART) + KONEPS API integration.

Feasibility: CONFIRMED (session 72 research). Four OpenAPI datasets on data.go.kr, all from 조달청:

Dataset 15129466 (사용자정보서비스): Supplier registry — BRN (사업자등록번호) is first-class query field
Dataset 15129427 (계약정보서비스): Contract records
Dataset 15129397 (낙찰정보서비스): Bid results
Dataset 15129394 (입찰공고정보서비스): Bid announcements

Blocking dependency: BRN extraction from DART company.json (zero marginal cost — bizr_no field already returned by existing API calls). Verify whether contract/bid endpoints accept BRN as input filter before building extractor.

Note on customs data: Korea Customs APIs (data.go.kr) return aggregate statistics only — not company-level records. Company-level trade data requires a commercial contract with the Korea Trade Statistics Promotion Institute. Customs integration is deferred indefinitely.

5B — Spatial Analysis Layer

Map accounting anomaly density geographically across Korea.

Concept: Geocode company headquarters (from DART registration data), join to KOSIS regional economic indicators, render Beneish flag density by region. Reveals whether manipulation clusters by geography, industry zone, or economic condition — a systemic insight unavailable from company-level analysis alone.

Stack: geopandas / shapely / pydeck or kepler.gl on top of existing parquet outputs.

Research questions:

Do Seoul tech firms, Busan manufacturing, and Incheon logistics show distinct manipulation profiles?
Does regional economic stress (unemployment, credit conditions) predict local flag density?
Are CB/BW anomaly clusters geographically co-located with known regulatory blind spots?

Prerequisites: KOSIS regional data integration (Phase 5A), geocoding of DART company addresses.

Open Backlog

ID	Description	Phase	Effort
~~PR5~~	~~Historical backfill 2014–2018~~ — Partial-Complete (Session 50): 2017–2018 backfill done; 2014–2016 deferred (most issuers resolved). `company_financials.parquet`: 7,042 → 9,310 rows (2019–2023 → 2017–2023). `beneish_scores.parquet`: 7,447 rows, 2018–2023, flagged 1,250. Beneish early-return Marimo bug fixed.	4	Medium
A1	~~Automate recurring data refresh~~ — Complete (Session 38): `krff refresh` command added to `cli.py`; runs 6 stages in sequence (DART → transform → beneish_screen → cb_bw → timing → network); `--sample N` and `--skip-analysis` flags	2	Low
I1	Verify PyKRX from hosted IPs — Infrastructure ready (Session 49): `--backend` option added to `krff run`/`krff refresh`; `finance-datareader`+`yfinance` as `[hosted]` optional deps; `test-hosted-backends.yml` workflow_dispatch CI workflow; trigger from GitHub Actions UI to verify	5	Low
~~DQ1~~	~~XBRL unit-scale corrections~~ — Complete (Session 48). frmtrm_amount cross-check confirmed no unit errors. Session 51: `BENEISH_EXTREME_OUTLIERS` constant deprecated — replaced by component-level winsorization (Beneish 1999 methodology). Inf values replaced with NaN; 1%/99% per-year winsorization applied to all 8 components.	1	Low

DQ1 outcome (Session 48): Cross-checked both flagged companies via DART frmtrm_amount. Neither is a unit-scale error:

피씨엘 (01051092) 2020: Genuine COVID-19 diagnostics revenue explosion. SGI=1,499, M-score=1,335 (now winsorized to bounded values).
프레스티지바이오로직스 (01258428) 2022/2023: Genuine revenue volatility. No correction needed.

Extreme values are now handled by per-year 1%/99% component winsorization in beneish_screen.py, not by a manual exclusion list.

FilesExpand file tree

ROADMAP.md

Latest commit

History