Skip to content

Add issue 10 backtesting case and stabilize LLM JSON handling#15

Draft
BrunoDC-dev wants to merge 3 commits into
mainfrom
feat/issue-10-backtesting-case-a
Draft

Add issue 10 backtesting case and stabilize LLM JSON handling#15
BrunoDC-dev wants to merge 3 commits into
mainfrom
feat/issue-10-backtesting-case-a

Conversation

@BrunoDC-dev

@BrunoDC-dev BrunoDC-dev commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds the issue S1 - Backtesting case A: simple self-verifiable event #10 backtesting case for Argentina vs Colombia Copa America 2024 with cutoff-controlled input, hidden ground truth, run notes, raw outputs, translated output, and objective evaluation.
  • Records the successful run report_3736fb6ac644: predicted Argentina, matching the ground truth winner.
  • Hardens OpenAI-compatible LLM calls used by ontology/Graphiti flows with timeouts, retries, JSON response mode, and schema normalization for common provider deviations.
  • Keeps Graphiti node attribute extraction disabled by default to avoid oversized/truncated JSON during the pilot flow.
  • Merges latest main, including the report-agent localization guards that should help prevent Chinese report output going forward.

Linked issue

Closes #10

How to test

  • cd backend && PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --frozen pytest ../tests/test_report_agent_resilience.py
  • cd backend && uv run --frozen python -m compileall app
  • npm run build

Validation

  • cd backend && PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --frozen pytest ../tests/test_report_agent_resilience.py -> 21 passed
  • cd backend && uv run --frozen python -m compileall app -> passed
  • npm run build -> passed, with existing Vite chunk-size/dynamic-import warnings

Notes

  • Backtesting.pdf and prueba-comedor.txt were intentionally left untracked because they are local/reference artifacts, not required for the PR.
  • The successful report is objectively correct for the winner metric, but the original raw output had language-quality issues. The translated copy is included only as an analysis artifact, while the raw output remains preserved.

Copy link
Copy Markdown
Collaborator Author

Estado de validacion actualizado:

  • El check de GitHub Actions validate habia fallado inicialmente solo por formato del cuerpo del PR: faltaban ## Linked issue y ## How to test.
  • Ya se agregaron esas secciones con Closes #10 y los pasos de prueba.
  • El nuevo run de validate paso correctamente.

Validaciones locales realizadas antes de abrir el PR:

  • cd backend && PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 uv run --frozen pytest ../tests/test_report_agent_resilience.py -> 21 passed
  • cd backend && uv run --frozen python -m compileall app -> passed
  • npm run build -> passed, solo con warnings existentes de Vite sobre chunk size/import dinamico

Estado actual: el PR queda en orden desde el lado de checks/validacion. Sigue como draft para que lo revisemos antes de marcarlo listo.

Copy link
Copy Markdown
Collaborator Author

Mapeo contra criterios de aceptacion del issue #10:

  • Ficha del caso con dominio, pregunta, x, delta y resultado real: incluido en backtesting/case-a/README.md y backtesting/case-a/ground-truth.md.
  • Documentos de entrada versionados/listados con fecha/fuente: incluidos en backtesting/case-a/input/source-01-opta-preview.txt y backtesting/case-a/input/source-02-conmebol-preview.txt; el detalle de fuentes y cutoff esta documentado en README.md.
  • La pregunta no contiene informacion posterior a x: prompt guardado en backtesting/case-a/prompt.md, con cutoff explicito al 13 de julio de 2024 y sin el resultado final.
  • La salida del sistema queda guardada: outputs preservados en backtesting/case-a/output/report_2d2de41798cf.md y backtesting/case-a/output/report_3736fb6ac644.md; tambien se agrega report_3736fb6ac644.es.md como traduccion auxiliar del reporte exitoso.
  • La evaluacion indica acierto/fallo y justificacion breve: documentado en backtesting/case-a/evaluation.md.

Evidencia esperada:

  • Carpeta/documento del caso: backtesting/case-a/
  • Input usado: backtesting/case-a/input/
  • Output de MiroFish: backtesting/case-a/output/
  • Evaluacion objetiva inicial: backtesting/case-a/evaluation.md

Resultado de la corrida evaluable: MiroFish predijo Argentina y el ground truth tambien fue Argentina, por lo tanto la metrica objetiva inicial queda como acierto. Nota: el reporte crudo tuvo problemas de idioma/calidad, registrados en la evaluacion, pero no cambia el acierto/fallo de la metrica binaria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S1 - Backtesting case A: simple self-verifiable event

1 participant