Skip to content

S2 temporal backtest: Bolivia 2025 runoff#24

Draft
BrunoDC-dev wants to merge 4 commits into
mainfrom
feat/issue-17-bolivia-runoff-backtesting-pr
Draft

S2 temporal backtest: Bolivia 2025 runoff#24
BrunoDC-dev wants to merge 4 commits into
mainfrom
feat/issue-17-bolivia-runoff-backtesting-pr

Conversation

@BrunoDC-dev

Copy link
Copy Markdown
Collaborator

Summary

Addresses #17 by adding a political-social S2 temporal backtesting case for the 2025 Bolivia presidential runoff.

What changed

  • Added backtesting/case-b-s2-bolivia-2025-runoff/ with case card, manifest, temporal packages, question, rubric, private ground truth, evaluator, run notes, reports, and scored outputs.
  • Added four cumulative evidence packages: T0, T1, T2, T3.
  • Added RESULTS.md and ISSUE_RESPONSE.md with the full interpretation of each temporal run.
  • Added report-agent stability fixes needed to complete the runs reliably.

Main findings

  • T0 failed to identify the correct runoff field with only early evidence.
  • T1 was the best run: after first-round surprise evidence, MiroFish predicted Rodrigo Paz correctly and nearly matched the final margin.
  • T2 and T3 shifted incorrectly toward Quiroga, showing salience/recency bias from platform framing and a late poll.
  • The T3 football-noise document did not materially affect the structured forecast.
  • No direct final-result leakage was found in the intended input packages.

Notes

Two caveats are documented in ISSUE_RESPONSE.md:

  • The issue requested seed_T0/T1/T2/T3; this PR uses assembled_T0.md through assembled_T3.md as the equivalent artifacts.
  • The saved runs are documented as Gemma probe runs, while the primary model policy target is Qwen. The temporal experiment is complete, but a strict primary-model pass can be added later if required.

Verification

python3 -m py_compile \
  backend/app/utils/llm_client.py \
  backend/app/services/zep_tools.py \
  backend/app/services/report_agent_quality_guards.py \
  backtesting/case-b-s2-bolivia-2025-runoff/eval_objective.py

The evaluator was run for T0, T1, T2, and T3; scored outputs are committed under the case output folders.

@LucasErcolano

LucasErcolano commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Análisis cualitativo de gaps para cerrar bien la issue S2 (#17):

  • CI / PR Hygiene: ahora falla porque el body no tiene las secciones exactas esperadas. Agregar:
    • ## Linked issue con Closes #17
    • ## How to test con comandos/verificación.
  • Vinculación automática: GitHub no detecta closingIssuesReferences; usar Closes #17 bajo la sección esperada.
  • Artefactos T0/T1/T2/T3: la issue pedía seed_T0/T1/T2/T3; el PR usa assembled_T0.md etc. Para cierre estricto, agregar/renombrar artefactos seed_T0, seed_T1, seed_T2, seed_T3 o documentar formalmente la equivalencia.
  • Ejecutar todo lo pedido por la issue: dejar evidencia de T0, T1, T2 y T3 completos con la misma question.md salvo evidencia disponible, ground truth fuera del input y evaluación objetiva por cada paquete.
  • Modelo primario fijo: las runs guardadas figuran como gemma_probe, mientras la issue pide un modelo primario fijo para no mezclar arquitectura con modelo. Falta una pasada limpia con el modelo primario definido para S2, o documentar evidencia equivalente si ya existe.
  • Seeds/réplicas: si la issue aplica la regla S2 de mejor condición con 3 runs, agregar esas 3 runs y reportar media, desvío, rango min/max, estabilidad narrativa, costo por run y fallas/parses inválidos.
  • Model ladder: si se ejecuta escalera de modelos, debe quedar separada del experimento principal. La comparación T0/T1/T2/T3 debe estar hecha con un modelo primario fijo.
  • Complexity gate: documentar checklist explícito: mínimo 6 documentos, 3 fechas, 3 fuentes/tipos, 2 hipótesis causales, 1 noise temporalmente válido, >20 entidades, ground truth fuera del input, evento post-cutoff, métrica definida.
  • Scope técnico: el PR incluye fixes en llm_client.py, zep_tools.py y quality guards. Si son necesarios para completar la run, explicar esa dependencia; si no, separarlos para que el experimento quede más auditable.

En resumen: para cerrar #17, el PR debe demostrar cumplimiento completo de los paquetes temporales, modelo primario fijo, seeds/réplicas requeridas, complexity gate y evaluación objetiva. La evidencia actual como gemma_probe no alcanza como cierre estricto si no se ejecutó también la configuración primaria pedida.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants