Paper 3a: population-scale Deadline-ILS pipeline + Phase B revision#10
Merged
Conversation
…/2 fixes Phase 0: - fflow/scoring/resolution_type.py: deadline regex classifier (v2) with classify_resolution_type() and classify_resolution_type_detailed() - scripts/phase0_typology_audit.py: v1/v2 comparison + full corpus scan - reports/TASK_03_TYPOLOGY_REFINEMENT.md: 6.18% deadline_resolved in corpus Phase 1: - fflow/scoring/ils.py: compute_ils_deadline() (paper §7 ILS_dl formula) + _DEADLINE_LOOKBACK constant + multi-window variants - fflow/scoring/pipeline.py: branches on resolution_type == "deadline_resolved"; deadline path skips NewsTimestamp, uses synthetic t_news = t_resolve - 1h - fflow/models.py: resolution_type column on Market + MarketLabel - fflow/taxonomy/classifier.py: classify_type_batch() with bulk IN-clause updates - fflow/cli.py: fflow taxonomy classify-type command; score batch includes deadline markets - alembic/versions/0005_stub.py: no-op stub anchoring missing 0003-0005 chain - alembic/versions/0006_market_labels_resolution_type.py: adds market_labels.resolution_type VARCHAR(30) + index - fflow/config.py: extra="ignore" for pydantic-settings - tests/test_ils_deadline.py: 27 tests (ILS_dl regimes, classifier, CLOB lag) Post-Phase-1 fixes: - _lookup_price: forward-only [t_open, t_open+30min] window for t_open lookups; accommodates CLOB indexing lag (~20 min typical); ±5 min symmetric for all other lookups - classify_type_batch: bulk IN-clause UPDATEs grouped by resolution_type instead of per-row; 900K backfill in 2m41s - Backfill run: 911,237 markets classified (56,316 deadline_resolved, 1,145 event_resolved, 853,776 unclassifiable) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ITEM A — Tier 3 web search enabled:
- fflow/news/llm_match.py: add tools=[{type: web_search_20250305}] to API call
- Dual prompts: _SYSTEM_T_NEWS (event_resolved) vs _SYSTEM_T_EVENT (deadline YES)
- _MAX_TOKENS: 300 → 1024 (web search synthesis needs more tokens)
- Response parsing: concatenate all text blocks (web search response arrives in
22+ interleaved server_tool_use + text fragments)
- Date parser: remove incorrect raw_date[:len(fmt)] slicing (format len ≠ output len)
- LLMTimestamp gains sources: tuple[str, ...] field
- Confidence: 0.80 when sources found (web search), 0.60 without
ITEM B — CLI tier3 branching by resolution_type (paper §7.2):
- deadline_resolved YES → recovery_mode=t_event ("when did event happen?")
- deadline_resolved NO → skip (T_resolve is authoritative, no event occurred)
- event_resolved / unclassifiable → recovery_mode=t_news (existing behavior)
- Echo label, sources, notes to stdout
ITEM C — compute_ils_deadline accepts recovered T_event (paper §7.2):
- New param: t_event: datetime | None = None
- When provided: t_event_minus = t_event - 1 min → p(T_event^-) per paper
- When None: falls back to legacy proxy t_resolve - lookback (backward compat)
- Adds 't_event_recovered' flag when t_event is used
- pipeline.py: deadline YES markets look up NewsTimestamp for T_event and pass it;
deadline NO markets stay on proxy path
Sanity test result (Iran Apr30 / US forces enter Iran):
T_event recovered: 2026-04-03T00:00:00Z (F-15E rescue op, 8 sources)
ILS_dl (paper §7.2): 0.113 vs legacy proxy -0.331 — materially different
Cost: $0.0897/call (108K input + 777 output tokens, Haiku 4.5)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fits S(τ)=exp(-λτ) per category over 50 Tier-3 T_event recoveries: - military_geopolitics: λ=0.306/d, T½=2.3d (KS adequate, n=9) - regulatory_decision: λ=0.035/d, T½=19.9d (KS rejects exponential, n=15) - corporate_disclosure: λ=0.156/d, T½=4.5d (KS adequate, n=5) New: fflow/scoring/hazard_fit.py, scripts/phase2_hazard_estimation.py, reports/TASK_03_HAZARD_ESTIMATION.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Recovered T_event for 16/18 FFIC (Iran/ceasefire) deadline-YES markets. Key findings: most 'strike by date X' markets opened after strikes had already begun (negative τ); positive-τ markets (JD Vance meeting τ=0.4d, Pipeline strike τ=15.9d, Iran×US strike τ=9.5d) are candidates for ILS_dl. Parser fix: strip trailing '**' markdown junk; add %Y-%m-%dT%H:%MZ format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Computed ILS_dl for 2 FFIC markets with price data: - US forces enter Iran Apr30: ILS_dl=+0.113, p_open=0.25, p_news=0.335 (t_event=Apr3) - US x Iran ceasefire Apr7: ILS=None (low_information_market, p_open=0.975) Wallet analysis (post-resolution window): - Iran Apr30: HHI_top10=0.0573, $9.78M total notional - 332 wallets active in both FFIC markets (cross-market coordination) - Top cross-market wallet: $1.96M across both (0x7072dd52) All 16/18 FFIC Tier-3 T_event recovered; trades only available post-T_event (resolution window), so pre-event wallet HHI is unavailable for these markets. Also includes: classifier.py force flag (from Phase 1, previously uncommitted) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lysis Synthesizes all Phase 1-4 findings: - Deadline-ILS formula, T_event recovery methodology, cost accounting ($7.47/83 calls) - Hazard fits: λ_military=0.306/d (T½=2.3d), λ_regulatory=0.035/d (T½=19.9d, bimodal) - FFIC Iran Apr30: ILS_dl=+0.113 (mild pre-event drift, no last-minute spike) - FFIC ceasefire Apr7: ILS=None (low_information_market, p_open=0.975) - 332 cross-market wallets (resolution arbitrage, not pre-event informed trading) - Paper v1.0 recommendations: exclude negative-τ markets, split regulatory bimodal, expand CLOB collection, set informed-trading threshold at ILS_dl>0.25 + short-window>0.10 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…script, phase05 fixture - reports/MADURO_VERIFICATION.md: full DOJ market-mapping analysis confirming all fficd-004 resolution outcomes are correct; no DB changes required - reports/MADURO_DATA_VERIFICATION.md: preliminary verification (superseded by above) - data/fficd-004-inventory.jsonl: 11-market structured inventory for Van Dyke/Venezuela cluster (7 YES + 4 NO), annotated with DOJ market names and fficd roles - data/fixture_phase05.jsonl: 100-market fixture (34 corporate_disclosure, 35 regulatory_decision, 31 military_geopolitics) for phase 05 pipeline testing - scripts/build_typology_dataset.py: DB → typology-v1.parquet + jsonl.gz extraction script; pyarrow dependency added to pyproject.toml - .gitignore: add datasets/ (large binary outputs, not for git) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New modules: - fflow/scoring/bootstrap.py: bootstrap CI for ILS^dl (B=500, seed=20260430, resamples YES trades in [T_open, T_event], CI=NULL when <50 trades) - fflow/news/t_event_recovery_v2.py: optimized T_event recovery — JSON output, granular confidence (0.9/0.8/0.7/0.5/0.0), no call cap, event-description cache (cheap Haiku), Haiku→Sonnet cascade at conf<0.7, cost alert at $40 - fflow/taxonomy/regulatory_split.py: keyword classifier splitting regulatory_decision into _announcement and _formal subtypes (10/10 on smoke-test cases) - scripts/paper3a_phase1.py: full population pipeline driver — pre-filter (12,708→3,514 without LLM), Step 0 hard assert on Iran-Apr30 (T_event=2026-04-03, ILS^dl=0.113±0.02), async T_event recovery (cap=20), ILS^dl via existing compute_ils_deadline, Tasks 1.2-1.8 post-processing; dry-run passes end-to-end Sample size discrepancy noted: parquet gives 12,708 (cat+vol≥50K), Paper 1 reports 11,263. Logged at runtime; does not affect pipeline. Usage: uv run python scripts/paper3a_phase1.py --confirm [--skip-step0] Budget: ~$28 (optimized) vs ~$120-150 (original estimate) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Edge filter: |p_open-0.5|>=0.4 (was strict >); 2 markets with p_open=0.9
now correctly excluded as near-certain (n_ils: 90->88)
- FFIC localization (task_1_6): fixed market_id_prefix lookup (was always
returning empty string via m.get("market_id")); prefix match via
str.startswith(); enriched notes from pop_df exclusion_reason and source
typology for pre-filtered markets; added FFIC_JSONL_ALT2 path fallback
- ils_compute_error exclusion_reason now includes exception type
(e.g. "ils_compute_error: PriceLookupError") for all 230 failures
- Updated FFIC_JSONL_ALT2 constant for third known FFIC path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tputs scripts/paper3a_revb.py implements all 8 reviewer-response tasks: - B7: anchor sensitivity recomputed on 88-market post-fix sample - B6: FFIC Bitcoin ETF T_event exact-match verification - B5: parametric bootstrap KS (exp p=0.224, weibull p=0.043, lognormal p=0.083) - B2: hazard-adjusted ILS with exponential decay model (D=T_resolve proxy) - B3: bootstrap CIs on medians and fraction-positive for 6 cells - B8: three-sample distribution reporting (all_computed, reg_ann, anchor_robust) - B4: Haiku classification of 20 tail markets (all plausible_leakage) - B1: T_event second-pass validation — 57.8% exact, 68.9% within-24h, $2.77 All 11 output CSVs/JSONs written to data/paper3a/revision1/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
t_event_recovery_v2: - Add provider field to TEventResult (anthropic/gemini/openai) - Improve prompt: ask for physical event time, not press-report time - Replace bare try/except with retry loop: 5 attempts with exponential backoff (10/20/40/60/60s), honouring Retry-After header on 429s - Import anthropic at module level for RateLimitError handling bootstrap.py: - Fix tz_convert bug: use tz_convert instead of tz= constructor kwarg for already-timezone-aware timestamps (was crashing on tz-aware input) ils.py: - Widen T_open forward price lookup window from 30 min to 24 h to cover illiquid/historical markets where the first trade arrives well after market creation (pre-CLOB era subgraph fallback) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
paper3a_haiku_tevent.py: two-stage Haiku T_event recovery pipeline — Stage 1 (no tools, training knowledge, ~$0.0005/market) then Stage 2 (web_search for nulls, ~$0.05/market). Writes checkpoint JSONL per market. backfill_clob_phase3a.py: backfills 1-minute CLOB prices for in-scope markets that have T_event but no CLOB coverage. Resumable via checkpoint. synthesize_prices_from_trades.py: computes per-minute VWAP from YES-outcome subgraph trades as CLOB fallback. Inserts only rows not covered by CLOB. test_haiku_fast.py: diagnostic script — samples 20 markets, runs Haiku without web_search, reports hit-rate/confidence/cost. Used to benchmark Stage 1 before committing to full run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
population_ils_dl.parquet / .csv (2,375 rows): Full population dataset with ILS^dl scores (88 markets), T_event, exclusion chain, anchor window variants, and B2 hazard-adjusted columns (expected_decay_price, ils_dl_adj, b2_method). Pipeline analysis outputs: filter_chain_attrition.csv — 6-stage attrition from 12,708 → 88 markets hazard_rates.csv — exponential hazard λ by category functional_form_comparison.csv / winners.csv — KS goodness-of-fit distribution_summary.csv / v2.csv — ILS^dl distribution by category×period detection_thresholds.csv — threshold analysis anchor_sensitivity_summary.csv — anchor-robust fraction (9-17% per category) ffic_localization.csv — FFIC case mapping to population ffic_concordance_test.csv — concordance test (1/32 cases with ILS^dl) ffic_classification_breakdown.csv — FFIC unclassifiable root cause regulatory_validation_sample.csv — 20-market regulatory random sample unclassifiable_sample.csv — 20-market unclassifiable sample Excluded from commit (operational): t_event_checkpoint.jsonl, clob_backfill_checkpoint.jsonl, phase1_log.jsonl. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts: # .gitignore # fflow/cli.py # fflow/models.py # fflow/news/llm_match.py # fflow/scoring/pipeline.py # fflow/scoring/resolution_type.py # pyproject.toml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Full implementation of Paper 3a — population-scale Deadline-ILS (ILS^dl) detection on Polymarket, plus all Phase B revision computations for the reviewer response.
Core pipeline (Task 03 + Phase 1):
unclassifiabletypologyfflow library fixes:
t_event_recovery_v2: retry loop with exponential backoff + Retry-After, improved event-vs-press-date prompt,providerfield onTEventResultbootstrap.py: tz_convert bug fix for timezone-aware timestampsils.py: T_open price lookup window widened 30min → 24h for pre-CLOB marketsllm_providers.py: new multi-tier Gemini/OpenAI/Sonnet cascadePhase B revision computations (
scripts/paper3a_revb.py):expected_decay_price,ils_dl_adj,b2_methodcolumnsData outputs:
data/paper3a/population_ils_dl.parquet— 2,375-row population dataset (with B2 columns)data/paper3a/revision1/— 11 CSVs/JSONs for all B tasksDataset release:
polymarket-deadline-ils-v3tagged and pushed to ForesightFlow/datasets.Test plan
uv run pytest— existing test suite passesuv run python scripts/paper3a_revb.py --skip-llm— all pure-computation tasks (B2–B8) complete without errordata/paper3a/population_ils_dl.parquetloads and has expected 2,375 rows, 28 columnsdata/paper3a/revision1/contains all 11 output files🤖 Generated with Claude Code