S1 Backtesting Case C: PILOT-ARG-2025-Q1 evaluation + critical pipeline fixes#16
Open
LucasErcolano wants to merge 6 commits into
Open
S1 Backtesting Case C: PILOT-ARG-2025-Q1 evaluation + critical pipeline fixes#16LucasErcolano wants to merge 6 commits into
LucasErcolano wants to merge 6 commits into
Conversation
Successful run: 40/40 OASIS epochs, 121 twitter actions, 18 agents. Report: structured Argentine 2025 political-economic prediction. Patches: - graphiti_backend.py: parallel chunk processing with asyncio.gather - run_twitter_simulation.py: add PlatformActionLogger for actions.jsonl IPC (backend could not detect OASIS progress without this) - mirofish_headless.py: save report + verdict files on success (previously only saved on BLOCKED/failed runs)
Applied 6-criterion rubric (1-5) to the TOP1 DeepInfra run output: - Especificidad: 3 (scenarios defined but no numerical probabilities) - Plausibilidad: 4 (Escenario B electoral confirmed at ~40.7%) - Cobertura: 3 (missing seat estimates and explicit probabilities) - Consistencia causal: 4 (inflation-perception-vote chain is correct) - Ausencia de info posterior: 4 (no data leakage detected) - Utilidad estrategica: 3 (actionable alerts but no timelines/probabilities) Key finding: MiroFish correctly identified the electoral scenario (35-42%) and inflation range (30-40%), but fell short on quantified probabilities and legislative impact estimates that the prompt specifically requested. Closes #12
16 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked issue
Closes #12
Qué hace este PR
Completa el S1 del Backtesting Case C (Argentina 2025) con un run exitoso de MiroFish, evaluación con rúbrica 1-5, y parches críticos al pipeline que permitieron que el run llegue a completarse.
Contenido
1. Run exitoso PILOT-ARG-2025-Q1 (TOP1 DeepInfra)
POLL_01_CB_Consultora_Diciembre_2024.pdf(TOP1)runs/headless/top1-deepinfra-pilot-arg-2025-q1-poll/run_manifest.json— provenance, status, IDsverdict_raw.json— full report datamirofish_report_raw.md— structured prediction (14,319 chars)request_trace.json— full API call tracerun_config.json,run_hashes.json2. Evaluación S1 con rúbrica
Archivo:
cases/PILOT-ARG-2025-Q1/S1_evaluation.mdHallazgo principal: MiroFish acertó el escenario electoral (LLA 35-42%, realidad ~40.7%) y el rango de inflación (30-40%, realidad ~30-35%), pero no asignó probabilidades numéricas ni estimó impacto legislativo como pedía el prompt.
3. Parches críticos al pipeline
3 bugs que impedían que cualquier run completara el flujo completo:
a)
backend/scripts/run_twitter_simulation.py— IPC actions.jsonl gaprun_twitter_simulation.pyno generabatwitter/actions.jsonl, que es el archivo quesimulation_runner.pyusa para detectar progreso y completion de OASISPlatformActionLoggerpara registrarsimulation_start,round_start,round_end, per-agent actions,simulation_endb)
tools/mirofish_headless.py— reporte no guardado en éxitomirofish_report_raw.mdyverdict_raw.jsonsolo en caso de fallo (except path)c)
backend/app/graph/graphiti_backend.py— procesamiento paralelo de chunksasyncio.gatherconGRAPHITI_CHUNK_PARALLELISMconfigurable4. Checklist de aceptación #12
S1_evaluation.mdy comentario de la issueS1_evaluation.mdruns/headless/top1-deepinfra-pilot-arg-2025-q1-poll/How to test
No runtime test applies. This PR adds evaluation artifacts and pipeline patches that require:
Evaluation review: Read
cases/PILOT-ARG-2025-Q1/S1_evaluation.mdand verify the rubric scores are justified by the report content inruns/headless/top1-deepinfra-pilot-arg-2025-q1-poll/mirofish_report_raw.md.Pipeline patches (manual verification only — requires Neo4j + DeepInfra API key):
tools/mirofish_headless.pywith any case input and--platform twitter --max-rounds 5twitter/actions.jsonlis generated in the simulation directory (patch a)mirofish_report_raw.mdandverdict_raw.jsonappear in the output dir after completion (patch b)Code review: Inspect the 3 patched files for correctness:
backend/scripts/run_twitter_simulation.py—PlatformActionLoggerintegrationtools/mirofish_headless.py— report save on successbackend/app/graph/graphiti_backend.py— parallel chunk processingNotas para S2