Paper 3a: population-scale Deadline-ILS pipeline + Phase B revision by MaksymDS · Pull Request #10 · ForesightFlow/platform

MaksymDS · 2026-04-30T17:03:16Z

Summary

Full implementation of Paper 3a — population-scale Deadline-ILS (ILS^dl) detection on Polymarket, plus all Phase B revision computations for the reviewer response.

Core pipeline (Task 03 + Phase 1):

Rule-based resolution typology classifier (deadline_resolved / event_resolved / unclassifiable) applied to 911,237 markets
Three-tier T_event recovery: Claude Haiku (no-tools) → Haiku+web_search → Sonnet+web_search; 442/2,375 markets recovered with confidence ≥ 0.7
ILS^dl computed for 88 markets (price history coverage bottleneck); bootstrap CI B=500
FFIC localization: 1/32 canonical cases with ILS^dl (Bitcoin ETF, score=0.012); 12 key cases blocked by unclassifiable typology

fflow library fixes:

t_event_recovery_v2: retry loop with exponential backoff + Retry-After, improved event-vs-press-date prompt, provider field on TEventResult
bootstrap.py: tz_convert bug fix for timezone-aware timestamps
ils.py: T_open price lookup window widened 30min → 24h for pre-CLOB markets
llm_providers.py: new multi-tier Gemini/OpenAI/Sonnet cascade

Phase B revision computations (scripts/paper3a_revb.py):

B7: anchor robustness recomputed on 88-market sample (9–17% per category)
B6: FFIC Bitcoin ETF T_event exact-match verification
B5: parametric bootstrap KS — exp p=0.224, Weibull p=0.043, lognormal p=0.083
B2: hazard-adjusted ILS^dl via exponential decay (D=T_resolve proxy); adds expected_decay_price, ils_dl_adj, b2_method columns
B3: bootstrap CIs on medians + fraction-positive for all 6 category×period cells
B8: three-sample distribution reporting (all_computed / reg_announcement / anchor_robust)
B4: Haiku classification of 20 tail markets — all classified as plausible_leakage
B1: T_event second-pass validation on 50 stratified markets — 57.8% exact match, 68.9% within-24h (regulatory_formal lowest at 35.7%)

Data outputs:

data/paper3a/population_ils_dl.parquet — 2,375-row population dataset (with B2 columns)
data/paper3a/revision1/ — 11 CSVs/JSONs for all B tasks
Supporting CSVs: attrition chain, hazard rates, FFIC localization, distribution summaries

Dataset release: polymarket-deadline-ils-v3 tagged and pushed to ForesightFlow/datasets.

Test plan

uv run pytest — existing test suite passes
uv run python scripts/paper3a_revb.py --skip-llm — all pure-computation tasks (B2–B8) complete without error
data/paper3a/population_ils_dl.parquet loads and has expected 2,375 rows, 28 columns
data/paper3a/revision1/ contains all 11 output files

🤖 Generated with Claude Code

…/2 fixes Phase 0: - fflow/scoring/resolution_type.py: deadline regex classifier (v2) with classify_resolution_type() and classify_resolution_type_detailed() - scripts/phase0_typology_audit.py: v1/v2 comparison + full corpus scan - reports/TASK_03_TYPOLOGY_REFINEMENT.md: 6.18% deadline_resolved in corpus Phase 1: - fflow/scoring/ils.py: compute_ils_deadline() (paper §7 ILS_dl formula) + _DEADLINE_LOOKBACK constant + multi-window variants - fflow/scoring/pipeline.py: branches on resolution_type == "deadline_resolved"; deadline path skips NewsTimestamp, uses synthetic t_news = t_resolve - 1h - fflow/models.py: resolution_type column on Market + MarketLabel - fflow/taxonomy/classifier.py: classify_type_batch() with bulk IN-clause updates - fflow/cli.py: fflow taxonomy classify-type command; score batch includes deadline markets - alembic/versions/0005_stub.py: no-op stub anchoring missing 0003-0005 chain - alembic/versions/0006_market_labels_resolution_type.py: adds market_labels.resolution_type VARCHAR(30) + index - fflow/config.py: extra="ignore" for pydantic-settings - tests/test_ils_deadline.py: 27 tests (ILS_dl regimes, classifier, CLOB lag) Post-Phase-1 fixes: - _lookup_price: forward-only [t_open, t_open+30min] window for t_open lookups; accommodates CLOB indexing lag (~20 min typical); ±5 min symmetric for all other lookups - classify_type_batch: bulk IN-clause UPDATEs grouped by resolution_type instead of per-row; 900K backfill in 2m41s - Backfill run: 911,237 markets classified (56,316 deadline_resolved, 1,145 event_resolved, 853,776 unclassifiable) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ITEM A — Tier 3 web search enabled: - fflow/news/llm_match.py: add tools=[{type: web_search_20250305}] to API call - Dual prompts: _SYSTEM_T_NEWS (event_resolved) vs _SYSTEM_T_EVENT (deadline YES) - _MAX_TOKENS: 300 → 1024 (web search synthesis needs more tokens) - Response parsing: concatenate all text blocks (web search response arrives in 22+ interleaved server_tool_use + text fragments) - Date parser: remove incorrect raw_date[:len(fmt)] slicing (format len ≠ output len) - LLMTimestamp gains sources: tuple[str, ...] field - Confidence: 0.80 when sources found (web search), 0.60 without ITEM B — CLI tier3 branching by resolution_type (paper §7.2): - deadline_resolved YES → recovery_mode=t_event ("when did event happen?") - deadline_resolved NO → skip (T_resolve is authoritative, no event occurred) - event_resolved / unclassifiable → recovery_mode=t_news (existing behavior) - Echo label, sources, notes to stdout ITEM C — compute_ils_deadline accepts recovered T_event (paper §7.2): - New param: t_event: datetime | None = None - When provided: t_event_minus = t_event - 1 min → p(T_event^-) per paper - When None: falls back to legacy proxy t_resolve - lookback (backward compat) - Adds 't_event_recovered' flag when t_event is used - pipeline.py: deadline YES markets look up NewsTimestamp for T_event and pass it; deadline NO markets stay on proxy path Sanity test result (Iran Apr30 / US forces enter Iran): T_event recovered: 2026-04-03T00:00:00Z (F-15E rescue op, 8 sources) ILS_dl (paper §7.2): 0.113 vs legacy proxy -0.331 — materially different Cost: $0.0897/call (108K input + 777 output tokens, Haiku 4.5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fits S(τ)=exp(-λτ) per category over 50 Tier-3 T_event recoveries: - military_geopolitics: λ=0.306/d, T½=2.3d (KS adequate, n=9) - regulatory_decision: λ=0.035/d, T½=19.9d (KS rejects exponential, n=15) - corporate_disclosure: λ=0.156/d, T½=4.5d (KS adequate, n=5) New: fflow/scoring/hazard_fit.py, scripts/phase2_hazard_estimation.py, reports/TASK_03_HAZARD_ESTIMATION.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Recovered T_event for 16/18 FFIC (Iran/ceasefire) deadline-YES markets. Key findings: most 'strike by date X' markets opened after strikes had already begun (negative τ); positive-τ markets (JD Vance meeting τ=0.4d, Pipeline strike τ=15.9d, Iran×US strike τ=9.5d) are candidates for ILS_dl. Parser fix: strip trailing '**' markdown junk; add %Y-%m-%dT%H:%MZ format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Computed ILS_dl for 2 FFIC markets with price data: - US forces enter Iran Apr30: ILS_dl=+0.113, p_open=0.25, p_news=0.335 (t_event=Apr3) - US x Iran ceasefire Apr7: ILS=None (low_information_market, p_open=0.975) Wallet analysis (post-resolution window): - Iran Apr30: HHI_top10=0.0573, $9.78M total notional - 332 wallets active in both FFIC markets (cross-market coordination) - Top cross-market wallet: $1.96M across both (0x7072dd52) All 16/18 FFIC Tier-3 T_event recovered; trades only available post-T_event (resolution window), so pre-event wallet HHI is unavailable for these markets. Also includes: classifier.py force flag (from Phase 1, previously uncommitted) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lysis Synthesizes all Phase 1-4 findings: - Deadline-ILS formula, T_event recovery methodology, cost accounting ($7.47/83 calls) - Hazard fits: λ_military=0.306/d (T½=2.3d), λ_regulatory=0.035/d (T½=19.9d, bimodal) - FFIC Iran Apr30: ILS_dl=+0.113 (mild pre-event drift, no last-minute spike) - FFIC ceasefire Apr7: ILS=None (low_information_market, p_open=0.975) - 332 cross-market wallets (resolution arbitrage, not pre-event informed trading) - Paper v1.0 recommendations: exclude negative-τ markets, split regulatory bimodal, expand CLOB collection, set informed-trading threshold at ILS_dl>0.25 + short-window>0.10 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…script, phase05 fixture - reports/MADURO_VERIFICATION.md: full DOJ market-mapping analysis confirming all fficd-004 resolution outcomes are correct; no DB changes required - reports/MADURO_DATA_VERIFICATION.md: preliminary verification (superseded by above) - data/fficd-004-inventory.jsonl: 11-market structured inventory for Van Dyke/Venezuela cluster (7 YES + 4 NO), annotated with DOJ market names and fficd roles - data/fixture_phase05.jsonl: 100-market fixture (34 corporate_disclosure, 35 regulatory_decision, 31 military_geopolitics) for phase 05 pipeline testing - scripts/build_typology_dataset.py: DB → typology-v1.parquet + jsonl.gz extraction script; pyarrow dependency added to pyproject.toml - .gitignore: add datasets/ (large binary outputs, not for git) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

New modules: - fflow/scoring/bootstrap.py: bootstrap CI for ILS^dl (B=500, seed=20260430, resamples YES trades in [T_open, T_event], CI=NULL when <50 trades) - fflow/news/t_event_recovery_v2.py: optimized T_event recovery — JSON output, granular confidence (0.9/0.8/0.7/0.5/0.0), no call cap, event-description cache (cheap Haiku), Haiku→Sonnet cascade at conf<0.7, cost alert at $40 - fflow/taxonomy/regulatory_split.py: keyword classifier splitting regulatory_decision into _announcement and _formal subtypes (10/10 on smoke-test cases) - scripts/paper3a_phase1.py: full population pipeline driver — pre-filter (12,708→3,514 without LLM), Step 0 hard assert on Iran-Apr30 (T_event=2026-04-03, ILS^dl=0.113±0.02), async T_event recovery (cap=20), ILS^dl via existing compute_ils_deadline, Tasks 1.2-1.8 post-processing; dry-run passes end-to-end Sample size discrepancy noted: parquet gives 12,708 (cat+vol≥50K), Paper 1 reports 11,263. Logged at runtime; does not affect pipeline. Usage: uv run python scripts/paper3a_phase1.py --confirm [--skip-step0] Budget: ~$28 (optimized) vs ~$120-150 (original estimate) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Edge filter: |p_open-0.5|>=0.4 (was strict >); 2 markets with p_open=0.9 now correctly excluded as near-certain (n_ils: 90->88) - FFIC localization (task_1_6): fixed market_id_prefix lookup (was always returning empty string via m.get("market_id")); prefix match via str.startswith(); enriched notes from pop_df exclusion_reason and source typology for pre-filtered markets; added FFIC_JSONL_ALT2 path fallback - ils_compute_error exclusion_reason now includes exception type (e.g. "ils_compute_error: PriceLookupError") for all 230 failures - Updated FFIC_JSONL_ALT2 constant for third known FFIC path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tputs scripts/paper3a_revb.py implements all 8 reviewer-response tasks: - B7: anchor sensitivity recomputed on 88-market post-fix sample - B6: FFIC Bitcoin ETF T_event exact-match verification - B5: parametric bootstrap KS (exp p=0.224, weibull p=0.043, lognormal p=0.083) - B2: hazard-adjusted ILS with exponential decay model (D=T_resolve proxy) - B3: bootstrap CIs on medians and fraction-positive for 6 cells - B8: three-sample distribution reporting (all_computed, reg_ann, anchor_robust) - B4: Haiku classification of 20 tail markets (all plausible_leakage) - B1: T_event second-pass validation — 57.8% exact, 68.9% within-24h, $2.77 All 11 output CSVs/JSONs written to data/paper3a/revision1/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

t_event_recovery_v2: - Add provider field to TEventResult (anthropic/gemini/openai) - Improve prompt: ask for physical event time, not press-report time - Replace bare try/except with retry loop: 5 attempts with exponential backoff (10/20/40/60/60s), honouring Retry-After header on 429s - Import anthropic at module level for RateLimitError handling bootstrap.py: - Fix tz_convert bug: use tz_convert instead of tz= constructor kwarg for already-timezone-aware timestamps (was crashing on tz-aware input) ils.py: - Widen T_open forward price lookup window from 30 min to 24 h to cover illiquid/historical markets where the first trade arrives well after market creation (pre-CLOB era subgraph fallback) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

paper3a_haiku_tevent.py: two-stage Haiku T_event recovery pipeline — Stage 1 (no tools, training knowledge, ~$0.0005/market) then Stage 2 (web_search for nulls, ~$0.05/market). Writes checkpoint JSONL per market. backfill_clob_phase3a.py: backfills 1-minute CLOB prices for in-scope markets that have T_event but no CLOB coverage. Resumable via checkpoint. synthesize_prices_from_trades.py: computes per-minute VWAP from YES-outcome subgraph trades as CLOB fallback. Inserts only rows not covered by CLOB. test_haiku_fast.py: diagnostic script — samples 20 markets, runs Haiku without web_search, reports hit-rate/confidence/cost. Used to benchmark Stage 1 before committing to full run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

population_ils_dl.parquet / .csv (2,375 rows): Full population dataset with ILS^dl scores (88 markets), T_event, exclusion chain, anchor window variants, and B2 hazard-adjusted columns (expected_decay_price, ils_dl_adj, b2_method). Pipeline analysis outputs: filter_chain_attrition.csv — 6-stage attrition from 12,708 → 88 markets hazard_rates.csv — exponential hazard λ by category functional_form_comparison.csv / winners.csv — KS goodness-of-fit distribution_summary.csv / v2.csv — ILS^dl distribution by category×period detection_thresholds.csv — threshold analysis anchor_sensitivity_summary.csv — anchor-robust fraction (9-17% per category) ffic_localization.csv — FFIC case mapping to population ffic_concordance_test.csv — concordance test (1/32 cases with ILS^dl) ffic_classification_breakdown.csv — FFIC unclassifiable root cause regulatory_validation_sample.csv — 20-market regulatory random sample unclassifiable_sample.csv — 20-market unclassifiable sample Excluded from commit (operational): t_event_checkpoint.jsonl, clob_backfill_checkpoint.jsonl, phase1_log.jsonl. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

# Conflicts: # .gitignore # fflow/cli.py # fflow/models.py # fflow/news/llm_match.py # fflow/scoring/pipeline.py # fflow/scoring/resolution_type.py # pyproject.toml

MaksymDS and others added 14 commits April 28, 2026 00:52

chore: gitignore operational checkpoints, .claude/, memory/

f611ba0

MaksymDS closed this Apr 30, 2026

MaksymDS reopened this Apr 30, 2026

Merge master into feat/typology-dataset-v1: resolve task02h conflicts

cefc69d

# Conflicts: # .gitignore # fflow/cli.py # fflow/models.py # fflow/news/llm_match.py # fflow/scoring/pipeline.py # fflow/scoring/resolution_type.py # pyproject.toml

MaksymDS merged commit a2d2c59 into master Apr 30, 2026
1 check failed

MaksymDS deleted the feat/typology-dataset-v1 branch May 1, 2026 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper 3a: population-scale Deadline-ILS pipeline + Phase B revision#10

Paper 3a: population-scale Deadline-ILS pipeline + Phase B revision#10
MaksymDS merged 15 commits into
masterfrom
feat/typology-dataset-v1

MaksymDS commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaksymDS commented Apr 30, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant