S1 Backtesting Case C: PILOT-ARG-2025-Q1 evaluation + critical pipeline fixes by LucasErcolano · Pull Request #16 · LucasErcolano/MiroFish

LucasErcolano · 2026-05-26T20:41:22Z

Linked issue

Closes #12

Qué hace este PR

Completa el S1 del Backtesting Case C (Argentina 2025) con un run exitoso de MiroFish, evaluación con rúbrica 1-5, y parches críticos al pipeline que permitieron que el run llegue a completarse.

Contenido

1. Run exitoso PILOT-ARG-2025-Q1 (TOP1 DeepInfra)

Flujo: headless → backend API → Graphiti graph build → OASIS simulation → Report generation (completo)
Input: POLL_01_CB_Consultora_Diciembre_2024.pdf (TOP1)
Resultado: 40/40 OASIS epochs, 121 twitter actions, 18 agents
Archivos: runs/headless/top1-deepinfra-pilot-arg-2025-q1-poll/
- run_manifest.json — provenance, status, IDs
- verdict_raw.json — full report data
- mirofish_report_raw.md — structured prediction (14,319 chars)
- request_trace.json — full API call trace
- run_config.json, run_hashes.json

2. Evaluación S1 con rúbrica

Archivo: cases/PILOT-ARG-2025-Q1/S1_evaluation.md

Criterio	Puntaje
Especificidad	3/5
Plausibilidad	4/5
Cobertura	3/5
Consistencia causal	4/5
Ausencia de info posterior al corte	4/5
Utilidad estratégica	3/5
Total	21/30 (70%)

Hallazgo principal: MiroFish acertó el escenario electoral (LLA 35-42%, realidad ~40.7%) y el rango de inflación (30-40%, realidad ~30-35%), pero no asignó probabilidades numéricas ni estimó impacto legislativo como pedía el prompt.

3. Parches críticos al pipeline

3 bugs que impedían que cualquier run completara el flujo completo:

a) backend/scripts/run_twitter_simulation.py — IPC actions.jsonl gap

run_twitter_simulation.py no generaba twitter/actions.jsonl, que es el archivo que simulation_runner.py usa para detectar progreso y completion de OASIS
Resultado: la simulación completaba 40 rondas pero el backend nunca lo detectaba (current_round=0 perpetuo)
Fix: agregar PlatformActionLogger para registrar simulation_start, round_start, round_end, per-agent actions, simulation_end

b) tools/mirofish_headless.py — reporte no guardado en éxito

El headless runner guardaba mirofish_report_raw.md y verdict_raw.json solo en caso de fallo (except path)
En éxito, hacía un GET al reporte pero descartaba el contenido
Fix: guardar markdown y verdict data en disco también en éxito

c) backend/app/graph/graphiti_backend.py — procesamiento paralelo de chunks

Versión secuencial original era extremadamente lenta con API
Fix: asyncio.gather con GRAPHITI_CHUNK_PARALLELISM configurable

4. Checklist de aceptación #12

Hay una ficha del caso con dominio, pregunta, x, delta y desenlace real → en S1_evaluation.md y comentario de la issue
Los documentos de entrada están listados y temporalmente controlados → todos fechados ≤ 31/01/2025
Existe una rúbrica 1-5 con criterios definidos → 6 criterios en S1_evaluation.md
La salida del sistema queda guardada → runs/headless/top1-deepinfra-pilot-arg-2025-q1-poll/
Al menos dos personas pueden aplicar la rúbrica o queda documentado el plan para hacerlo en S2 → notas en evaluación

How to test

No runtime test applies. This PR adds evaluation artifacts and pipeline patches that require:

Evaluation review: Read cases/PILOT-ARG-2025-Q1/S1_evaluation.md and verify the rubric scores are justified by the report content in runs/headless/top1-deepinfra-pilot-arg-2025-q1-poll/mirofish_report_raw.md.
Pipeline patches (manual verification only — requires Neo4j + DeepInfra API key):
- Start Neo4j + backend with DeepInfra env vars
- Run tools/mirofish_headless.py with any case input and --platform twitter --max-rounds 5
- Verify twitter/actions.jsonl is generated in the simulation directory (patch a)
- Verify mirofish_report_raw.md and verdict_raw.json appear in the output dir after completion (patch b)
Code review: Inspect the 3 patched files for correctness:
- backend/scripts/run_twitter_simulation.py — PlatformActionLogger integration
- tools/mirofish_headless.py — report save on success
- backend/app/graph/graphiti_backend.py — parallel chunk processing

Notas para S2

Run con TOP3 (POLL_01 + BCRA REM + BBVA Outlook) para mejorar cobertura macro
Ajustar prompt/report template para forzar probabilidades numéricas
Agregar sección de estimación de escaños en el template
Evaluación inter-rater con al menos 2 evaluadores independientes

…q1-artifacts

Successful run: 40/40 OASIS epochs, 121 twitter actions, 18 agents. Report: structured Argentine 2025 political-economic prediction. Patches: - graphiti_backend.py: parallel chunk processing with asyncio.gather - run_twitter_simulation.py: add PlatformActionLogger for actions.jsonl IPC (backend could not detect OASIS progress without this) - mirofish_headless.py: save report + verdict files on success (previously only saved on BLOCKED/failed runs)

Applied 6-criterion rubric (1-5) to the TOP1 DeepInfra run output: - Especificidad: 3 (scenarios defined but no numerical probabilities) - Plausibilidad: 4 (Escenario B electoral confirmed at ~40.7%) - Cobertura: 3 (missing seat estimates and explicit probabilities) - Consistencia causal: 4 (inflation-perception-vote chain is correct) - Ausencia de info posterior: 4 (no data leakage detected) - Utilidad estrategica: 3 (actionable alerts but no timelines/probabilities) Key finding: MiroFish correctly identified the electoral scenario (35-42%) and inflation range (30-40%), but fell short on quantified probabilities and legislative impact estimates that the prompt specifically requested. Closes #12

LucasErcolano added 6 commits May 22, 2026 17:04

chore: add Argentina 2025 pilot case artifacts

bb779c1

chore: keep pilot artifacts isolated to cases

f9a7596

Merge remote-tracking branch 'origin/main' into chore/pilot-arg-2025-…

baa2003

…q1-artifacts

refactor report agent localization guards

9c8fd0c

LucasErcolano self-assigned this May 26, 2026

LucasErcolano mentioned this pull request May 27, 2026

S2 - Investigador 1: Caso cualitativo (Issue #12) + Línea 1 — Actualización temporal con evidencia post-cutoff #17

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S1 Backtesting Case C: PILOT-ARG-2025-Q1 evaluation + critical pipeline fixes#16

S1 Backtesting Case C: PILOT-ARG-2025-Q1 evaluation + critical pipeline fixes#16
LucasErcolano wants to merge 6 commits into
mainfrom
chore/pilot-arg-2025-q1-artifacts

LucasErcolano commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LucasErcolano commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issue

Qué hace este PR

Contenido

1. Run exitoso PILOT-ARG-2025-Q1 (TOP1 DeepInfra)

2. Evaluación S1 con rúbrica

3. Parches críticos al pipeline

4. Checklist de aceptación #12

How to test

Notas para S2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LucasErcolano commented May 26, 2026 •

edited

Loading