Skip to content

Paper 3a: population-scale Deadline-ILS pipeline + Phase B revision#10

Merged
MaksymDS merged 15 commits into
masterfrom
feat/typology-dataset-v1
Apr 30, 2026
Merged

Paper 3a: population-scale Deadline-ILS pipeline + Phase B revision#10
MaksymDS merged 15 commits into
masterfrom
feat/typology-dataset-v1

Conversation

@MaksymDS
Copy link
Copy Markdown
Contributor

Summary

Full implementation of Paper 3a — population-scale Deadline-ILS (ILS^dl) detection on Polymarket, plus all Phase B revision computations for the reviewer response.

Core pipeline (Task 03 + Phase 1):

  • Rule-based resolution typology classifier (deadline_resolved / event_resolved / unclassifiable) applied to 911,237 markets
  • Three-tier T_event recovery: Claude Haiku (no-tools) → Haiku+web_search → Sonnet+web_search; 442/2,375 markets recovered with confidence ≥ 0.7
  • ILS^dl computed for 88 markets (price history coverage bottleneck); bootstrap CI B=500
  • FFIC localization: 1/32 canonical cases with ILS^dl (Bitcoin ETF, score=0.012); 12 key cases blocked by unclassifiable typology

fflow library fixes:

  • t_event_recovery_v2: retry loop with exponential backoff + Retry-After, improved event-vs-press-date prompt, provider field on TEventResult
  • bootstrap.py: tz_convert bug fix for timezone-aware timestamps
  • ils.py: T_open price lookup window widened 30min → 24h for pre-CLOB markets
  • llm_providers.py: new multi-tier Gemini/OpenAI/Sonnet cascade

Phase B revision computations (scripts/paper3a_revb.py):

  • B7: anchor robustness recomputed on 88-market sample (9–17% per category)
  • B6: FFIC Bitcoin ETF T_event exact-match verification
  • B5: parametric bootstrap KS — exp p=0.224, Weibull p=0.043, lognormal p=0.083
  • B2: hazard-adjusted ILS^dl via exponential decay (D=T_resolve proxy); adds expected_decay_price, ils_dl_adj, b2_method columns
  • B3: bootstrap CIs on medians + fraction-positive for all 6 category×period cells
  • B8: three-sample distribution reporting (all_computed / reg_announcement / anchor_robust)
  • B4: Haiku classification of 20 tail markets — all classified as plausible_leakage
  • B1: T_event second-pass validation on 50 stratified markets — 57.8% exact match, 68.9% within-24h (regulatory_formal lowest at 35.7%)

Data outputs:

  • data/paper3a/population_ils_dl.parquet — 2,375-row population dataset (with B2 columns)
  • data/paper3a/revision1/ — 11 CSVs/JSONs for all B tasks
  • Supporting CSVs: attrition chain, hazard rates, FFIC localization, distribution summaries

Dataset release: polymarket-deadline-ils-v3 tagged and pushed to ForesightFlow/datasets.

Test plan

  • uv run pytest — existing test suite passes
  • uv run python scripts/paper3a_revb.py --skip-llm — all pure-computation tasks (B2–B8) complete without error
  • data/paper3a/population_ils_dl.parquet loads and has expected 2,375 rows, 28 columns
  • data/paper3a/revision1/ contains all 11 output files

🤖 Generated with Claude Code

MaksymDS and others added 14 commits April 28, 2026 00:52
…/2 fixes

Phase 0:
- fflow/scoring/resolution_type.py: deadline regex classifier (v2) with
  classify_resolution_type() and classify_resolution_type_detailed()
- scripts/phase0_typology_audit.py: v1/v2 comparison + full corpus scan
- reports/TASK_03_TYPOLOGY_REFINEMENT.md: 6.18% deadline_resolved in corpus

Phase 1:
- fflow/scoring/ils.py: compute_ils_deadline() (paper §7 ILS_dl formula)
  + _DEADLINE_LOOKBACK constant + multi-window variants
- fflow/scoring/pipeline.py: branches on resolution_type == "deadline_resolved";
  deadline path skips NewsTimestamp, uses synthetic t_news = t_resolve - 1h
- fflow/models.py: resolution_type column on Market + MarketLabel
- fflow/taxonomy/classifier.py: classify_type_batch() with bulk IN-clause updates
- fflow/cli.py: fflow taxonomy classify-type command; score batch includes
  deadline markets
- alembic/versions/0005_stub.py: no-op stub anchoring missing 0003-0005 chain
- alembic/versions/0006_market_labels_resolution_type.py: adds
  market_labels.resolution_type VARCHAR(30) + index
- fflow/config.py: extra="ignore" for pydantic-settings
- tests/test_ils_deadline.py: 27 tests (ILS_dl regimes, classifier, CLOB lag)

Post-Phase-1 fixes:
- _lookup_price: forward-only [t_open, t_open+30min] window for t_open lookups;
  accommodates CLOB indexing lag (~20 min typical); ±5 min symmetric for all
  other lookups
- classify_type_batch: bulk IN-clause UPDATEs grouped by resolution_type
  instead of per-row; 900K backfill in 2m41s
- Backfill run: 911,237 markets classified (56,316 deadline_resolved,
  1,145 event_resolved, 853,776 unclassifiable)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ITEM A — Tier 3 web search enabled:
- fflow/news/llm_match.py: add tools=[{type: web_search_20250305}] to API call
- Dual prompts: _SYSTEM_T_NEWS (event_resolved) vs _SYSTEM_T_EVENT (deadline YES)
- _MAX_TOKENS: 300 → 1024 (web search synthesis needs more tokens)
- Response parsing: concatenate all text blocks (web search response arrives in
  22+ interleaved server_tool_use + text fragments)
- Date parser: remove incorrect raw_date[:len(fmt)] slicing (format len ≠ output len)
- LLMTimestamp gains sources: tuple[str, ...] field
- Confidence: 0.80 when sources found (web search), 0.60 without

ITEM B — CLI tier3 branching by resolution_type (paper §7.2):
- deadline_resolved YES → recovery_mode=t_event ("when did event happen?")
- deadline_resolved NO  → skip (T_resolve is authoritative, no event occurred)
- event_resolved / unclassifiable → recovery_mode=t_news (existing behavior)
- Echo label, sources, notes to stdout

ITEM C — compute_ils_deadline accepts recovered T_event (paper §7.2):
- New param: t_event: datetime | None = None
- When provided: t_event_minus = t_event - 1 min → p(T_event^-) per paper
- When None: falls back to legacy proxy t_resolve - lookback (backward compat)
- Adds 't_event_recovered' flag when t_event is used
- pipeline.py: deadline YES markets look up NewsTimestamp for T_event and pass it;
  deadline NO markets stay on proxy path

Sanity test result (Iran Apr30 / US forces enter Iran):
  T_event recovered: 2026-04-03T00:00:00Z (F-15E rescue op, 8 sources)
  ILS_dl (paper §7.2): 0.113 vs legacy proxy -0.331 — materially different
  Cost: $0.0897/call (108K input + 777 output tokens, Haiku 4.5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fits S(τ)=exp(-λτ) per category over 50 Tier-3 T_event recoveries:
- military_geopolitics: λ=0.306/d, T½=2.3d (KS adequate, n=9)
- regulatory_decision:  λ=0.035/d, T½=19.9d (KS rejects exponential, n=15)
- corporate_disclosure: λ=0.156/d, T½=4.5d (KS adequate, n=5)

New: fflow/scoring/hazard_fit.py, scripts/phase2_hazard_estimation.py,
reports/TASK_03_HAZARD_ESTIMATION.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Recovered T_event for 16/18 FFIC (Iran/ceasefire) deadline-YES markets.
Key findings: most 'strike by date X' markets opened after strikes had
already begun (negative τ); positive-τ markets (JD Vance meeting τ=0.4d,
Pipeline strike τ=15.9d, Iran×US strike τ=9.5d) are candidates for ILS_dl.

Parser fix: strip trailing '**' markdown junk; add %Y-%m-%dT%H:%MZ format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Computed ILS_dl for 2 FFIC markets with price data:
- US forces enter Iran Apr30: ILS_dl=+0.113, p_open=0.25, p_news=0.335 (t_event=Apr3)
- US x Iran ceasefire Apr7: ILS=None (low_information_market, p_open=0.975)

Wallet analysis (post-resolution window):
- Iran Apr30: HHI_top10=0.0573, $9.78M total notional
- 332 wallets active in both FFIC markets (cross-market coordination)
- Top cross-market wallet: $1.96M across both (0x7072dd52)

All 16/18 FFIC Tier-3 T_event recovered; trades only available post-T_event
(resolution window), so pre-event wallet HHI is unavailable for these markets.

Also includes: classifier.py force flag (from Phase 1, previously uncommitted)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lysis

Synthesizes all Phase 1-4 findings:
- Deadline-ILS formula, T_event recovery methodology, cost accounting ($7.47/83 calls)
- Hazard fits: λ_military=0.306/d (T½=2.3d), λ_regulatory=0.035/d (T½=19.9d, bimodal)
- FFIC Iran Apr30: ILS_dl=+0.113 (mild pre-event drift, no last-minute spike)
- FFIC ceasefire Apr7: ILS=None (low_information_market, p_open=0.975)
- 332 cross-market wallets (resolution arbitrage, not pre-event informed trading)
- Paper v1.0 recommendations: exclude negative-τ markets, split regulatory bimodal,
  expand CLOB collection, set informed-trading threshold at ILS_dl>0.25 + short-window>0.10

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…script, phase05 fixture

- reports/MADURO_VERIFICATION.md: full DOJ market-mapping analysis confirming all
  fficd-004 resolution outcomes are correct; no DB changes required
- reports/MADURO_DATA_VERIFICATION.md: preliminary verification (superseded by above)
- data/fficd-004-inventory.jsonl: 11-market structured inventory for Van Dyke/Venezuela
  cluster (7 YES + 4 NO), annotated with DOJ market names and fficd roles
- data/fixture_phase05.jsonl: 100-market fixture (34 corporate_disclosure,
  35 regulatory_decision, 31 military_geopolitics) for phase 05 pipeline testing
- scripts/build_typology_dataset.py: DB → typology-v1.parquet + jsonl.gz extraction
  script; pyarrow dependency added to pyproject.toml
- .gitignore: add datasets/ (large binary outputs, not for git)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New modules:
- fflow/scoring/bootstrap.py: bootstrap CI for ILS^dl (B=500, seed=20260430,
  resamples YES trades in [T_open, T_event], CI=NULL when <50 trades)
- fflow/news/t_event_recovery_v2.py: optimized T_event recovery — JSON output,
  granular confidence (0.9/0.8/0.7/0.5/0.0), no call cap, event-description
  cache (cheap Haiku), Haiku→Sonnet cascade at conf<0.7, cost alert at $40
- fflow/taxonomy/regulatory_split.py: keyword classifier splitting
  regulatory_decision into _announcement and _formal subtypes (10/10 on
  smoke-test cases)
- scripts/paper3a_phase1.py: full population pipeline driver — pre-filter
  (12,708→3,514 without LLM), Step 0 hard assert on Iran-Apr30
  (T_event=2026-04-03, ILS^dl=0.113±0.02), async T_event recovery (cap=20),
  ILS^dl via existing compute_ils_deadline, Tasks 1.2-1.8 post-processing;
  dry-run passes end-to-end

Sample size discrepancy noted: parquet gives 12,708 (cat+vol≥50K),
Paper 1 reports 11,263. Logged at runtime; does not affect pipeline.

Usage: uv run python scripts/paper3a_phase1.py --confirm [--skip-step0]
Budget: ~$28 (optimized) vs ~$120-150 (original estimate)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Edge filter: |p_open-0.5|>=0.4 (was strict >); 2 markets with p_open=0.9
  now correctly excluded as near-certain (n_ils: 90->88)
- FFIC localization (task_1_6): fixed market_id_prefix lookup (was always
  returning empty string via m.get("market_id")); prefix match via
  str.startswith(); enriched notes from pop_df exclusion_reason and source
  typology for pre-filtered markets; added FFIC_JSONL_ALT2 path fallback
- ils_compute_error exclusion_reason now includes exception type
  (e.g. "ils_compute_error: PriceLookupError") for all 230 failures
- Updated FFIC_JSONL_ALT2 constant for third known FFIC path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tputs

scripts/paper3a_revb.py implements all 8 reviewer-response tasks:
- B7: anchor sensitivity recomputed on 88-market post-fix sample
- B6: FFIC Bitcoin ETF T_event exact-match verification
- B5: parametric bootstrap KS (exp p=0.224, weibull p=0.043, lognormal p=0.083)
- B2: hazard-adjusted ILS with exponential decay model (D=T_resolve proxy)
- B3: bootstrap CIs on medians and fraction-positive for 6 cells
- B8: three-sample distribution reporting (all_computed, reg_ann, anchor_robust)
- B4: Haiku classification of 20 tail markets (all plausible_leakage)
- B1: T_event second-pass validation — 57.8% exact, 68.9% within-24h, $2.77

All 11 output CSVs/JSONs written to data/paper3a/revision1/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
t_event_recovery_v2:
- Add provider field to TEventResult (anthropic/gemini/openai)
- Improve prompt: ask for physical event time, not press-report time
- Replace bare try/except with retry loop: 5 attempts with exponential
  backoff (10/20/40/60/60s), honouring Retry-After header on 429s
- Import anthropic at module level for RateLimitError handling

bootstrap.py:
- Fix tz_convert bug: use tz_convert instead of tz= constructor kwarg
  for already-timezone-aware timestamps (was crashing on tz-aware input)

ils.py:
- Widen T_open forward price lookup window from 30 min to 24 h to
  cover illiquid/historical markets where the first trade arrives well
  after market creation (pre-CLOB era subgraph fallback)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
paper3a_haiku_tevent.py: two-stage Haiku T_event recovery pipeline —
  Stage 1 (no tools, training knowledge, ~$0.0005/market) then Stage 2
  (web_search for nulls, ~$0.05/market). Writes checkpoint JSONL per market.

backfill_clob_phase3a.py: backfills 1-minute CLOB prices for in-scope
  markets that have T_event but no CLOB coverage. Resumable via checkpoint.

synthesize_prices_from_trades.py: computes per-minute VWAP from YES-outcome
  subgraph trades as CLOB fallback. Inserts only rows not covered by CLOB.

test_haiku_fast.py: diagnostic script — samples 20 markets, runs Haiku
  without web_search, reports hit-rate/confidence/cost. Used to benchmark
  Stage 1 before committing to full run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
population_ils_dl.parquet / .csv (2,375 rows):
  Full population dataset with ILS^dl scores (88 markets), T_event, exclusion
  chain, anchor window variants, and B2 hazard-adjusted columns
  (expected_decay_price, ils_dl_adj, b2_method).

Pipeline analysis outputs:
  filter_chain_attrition.csv   — 6-stage attrition from 12,708 → 88 markets
  hazard_rates.csv             — exponential hazard λ by category
  functional_form_comparison.csv / winners.csv — KS goodness-of-fit
  distribution_summary.csv / v2.csv — ILS^dl distribution by category×period
  detection_thresholds.csv     — threshold analysis
  anchor_sensitivity_summary.csv — anchor-robust fraction (9-17% per category)
  ffic_localization.csv        — FFIC case mapping to population
  ffic_concordance_test.csv    — concordance test (1/32 cases with ILS^dl)
  ffic_classification_breakdown.csv — FFIC unclassifiable root cause
  regulatory_validation_sample.csv — 20-market regulatory random sample
  unclassifiable_sample.csv    — 20-market unclassifiable sample

Excluded from commit (operational): t_event_checkpoint.jsonl,
clob_backfill_checkpoint.jsonl, phase1_log.jsonl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@MaksymDS MaksymDS closed this Apr 30, 2026
@MaksymDS MaksymDS reopened this Apr 30, 2026
# Conflicts:
#	.gitignore
#	fflow/cli.py
#	fflow/models.py
#	fflow/news/llm_match.py
#	fflow/scoring/pipeline.py
#	fflow/scoring/resolution_type.py
#	pyproject.toml
@MaksymDS MaksymDS merged commit a2d2c59 into master Apr 30, 2026
1 check failed
@MaksymDS MaksymDS deleted the feat/typology-dataset-v1 branch May 1, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant