S2 temporal backtest: Bolivia 2025 runoff by BrunoDC-dev · Pull Request #24 · LucasErcolano/MiroFish

BrunoDC-dev · 2026-06-03T20:28:10Z

Summary

Addresses #17 by adding a political-social S2 temporal backtesting case for the 2025 Bolivia presidential runoff.

What changed

Added backtesting/case-b-s2-bolivia-2025-runoff/ with case card, manifest, temporal packages, question, rubric, private ground truth, evaluator, run notes, reports, and scored outputs.
Added four cumulative evidence packages: T0, T1, T2, T3.
Added RESULTS.md and ISSUE_RESPONSE.md with the full interpretation of each temporal run.
Added report-agent stability fixes needed to complete the runs reliably.

Main findings

T0 failed to identify the correct runoff field with only early evidence.
T1 was the best run: after first-round surprise evidence, MiroFish predicted Rodrigo Paz correctly and nearly matched the final margin.
T2 and T3 shifted incorrectly toward Quiroga, showing salience/recency bias from platform framing and a late poll.
The T3 football-noise document did not materially affect the structured forecast.
No direct final-result leakage was found in the intended input packages.

Notes

Two caveats are documented in ISSUE_RESPONSE.md:

The issue requested seed_T0/T1/T2/T3; this PR uses assembled_T0.md through assembled_T3.md as the equivalent artifacts.
The saved runs are documented as Gemma probe runs, while the primary model policy target is Qwen. The temporal experiment is complete, but a strict primary-model pass can be added later if required.

Verification

python3 -m py_compile \
  backend/app/utils/llm_client.py \
  backend/app/services/zep_tools.py \
  backend/app/services/report_agent_quality_guards.py \
  backtesting/case-b-s2-bolivia-2025-runoff/eval_objective.py

The evaluator was run for T0, T1, T2, and T3; scored outputs are committed under the case output folders.

LucasErcolano · 2026-06-06T17:49:55Z

Análisis cualitativo de gaps para cerrar bien la issue S2 (#17):

CI / PR Hygiene: ahora falla porque el body no tiene las secciones exactas esperadas. Agregar:
- ## Linked issue con Closes #17
- ## How to test con comandos/verificación.
Vinculación automática: GitHub no detecta closingIssuesReferences; usar Closes #17 bajo la sección esperada.
Artefactos T0/T1/T2/T3: la issue pedía seed_T0/T1/T2/T3; el PR usa assembled_T0.md etc. Para cierre estricto, agregar/renombrar artefactos seed_T0, seed_T1, seed_T2, seed_T3 o documentar formalmente la equivalencia.
Ejecutar todo lo pedido por la issue: dejar evidencia de T0, T1, T2 y T3 completos con la misma question.md salvo evidencia disponible, ground truth fuera del input y evaluación objetiva por cada paquete.
Modelo primario fijo: las runs guardadas figuran como gemma_probe, mientras la issue pide un modelo primario fijo para no mezclar arquitectura con modelo. Falta una pasada limpia con el modelo primario definido para S2, o documentar evidencia equivalente si ya existe.
Seeds/réplicas: si la issue aplica la regla S2 de mejor condición con 3 runs, agregar esas 3 runs y reportar media, desvío, rango min/max, estabilidad narrativa, costo por run y fallas/parses inválidos.
Model ladder: si se ejecuta escalera de modelos, debe quedar separada del experimento principal. La comparación T0/T1/T2/T3 debe estar hecha con un modelo primario fijo.
Complexity gate: documentar checklist explícito: mínimo 6 documentos, 3 fechas, 3 fuentes/tipos, 2 hipótesis causales, 1 noise temporalmente válido, >20 entidades, ground truth fuera del input, evento post-cutoff, métrica definida.
Scope técnico: el PR incluye fixes en llm_client.py, zep_tools.py y quality guards. Si son necesarios para completar la run, explicar esa dependencia; si no, separarlos para que el experimento quede más auditable.

En resumen: para cerrar #17, el PR debe demostrar cumplimiento completo de los paquetes temporales, modelo primario fijo, seeds/réplicas requeridas, complexity gate y evaluación objetiva. La evidencia actual como gemma_probe no alcanza como cierre estricto si no se ejecutó también la configuración primaria pedida.

BrunoDC-dev added 4 commits June 3, 2026 17:27

Stabilize report agent backtesting flow

b644b2a

Add Bolivia runoff temporal backtest

16ba3ff

Expand Bolivia runoff issue response

1cad2bb

Add issue 17 acceptance checklist

d67ff02

BrunoDC-dev mentioned this pull request Jun 3, 2026

S2 - Investigador 1: Caso cualitativo (Issue #12) + Línea 1 — Actualización temporal con evidencia post-cutoff #17

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S2 temporal backtest: Bolivia 2025 runoff#24

S2 temporal backtest: Bolivia 2025 runoff#24
BrunoDC-dev wants to merge 4 commits into
mainfrom
feat/issue-17-bolivia-runoff-backtesting-pr

BrunoDC-dev commented Jun 3, 2026

Uh oh!

LucasErcolano commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BrunoDC-dev commented Jun 3, 2026

Summary

What changed

Main findings

Notes

Verification

Uh oh!

LucasErcolano commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucasErcolano commented Jun 6, 2026 •

edited

Loading