[codex] S2 Issue 19 scheduled injection and ReportAgent evaluation by Joacocade · Pull Request #25 · LucasErcolano/MiroFish

Joacocade · 2026-06-08T00:42:59Z

Linked issue

Closes #19

Summary

Add scheduled intra-run Reddit injection support for Issue 19 using event_config.scheduled_events.
Add the S2 Issue 19 benchmark packet, acceptance response, evidence policy, V1/V2 evaluation artifacts, and DeepInfra robustness artifacts.
Add ReportAgent artifact_only mode so per-condition reports can be generated from isolated evidence bundles without shared graph/Zep tool reads.
Generate artifact-only ReportAgent outputs for Qwen, Gemma, and Llama across all six V2 conditions.

Evidence

Issue checklist: backtesting/case-a-s2-positional-noise/ISSUE_RESPONSE.md
V1 report: backtesting/case-a-s2-positional-noise/evaluation/final_issue_report.md
V2 report: backtesting/case-a-s2-positional-noise-v2/evaluation/final_v2_report.md
DeepInfra report: backtesting/case-a-s2-positional-noise-v2/evaluation_deepinfra/final_deepinfra_report.md
ReportAgent manifest: backtesting/case-a-s2-positional-noise-v2/evaluation_report_agent/report_agent_manifest.csv

ReportAgent manifest status: 18 rows, 18 completed, 0 errors.

How to test

From backend/:

uv run --frozen pytest ../tests/test_report_agent_resilience.py -q

Expected result: 23 passed.

ReportAgent artifact-only verification already completed for:

qwen/qwen3-8b
google/gemma-3-27b-it
meta-llama/Llama-3.3-70B-Instruct-Turbo

The committed manifest at backtesting/case-a-s2-positional-noise-v2/evaluation_report_agent/report_agent_manifest.csv has 18 rows, 18 completed, and 0 errors.

Also verified no <tool_call>, <tool_code>, Thought, or tool-failure leakage in committed ReportAgent outputs.

Notes

The runs/ SQLite/log artifacts remain local reproducibility evidence and are intentionally not committed. The committed PR evidence is the compact summaries, metrics, narrative scores, final reports, and ReportAgent outputs.

Route OASIS simulation agents to different LLMs via a YAML model map and record per-call telemetry (tokens, latency, estimated cost) so every agent action is traceable to the model that produced it. Fully opt-in via --model-map; single-model behavior is unchanged without it. - model_router.py: load/validate model map, resolve ModelPolicy per agent (precedence by_agent_id > by_role > default), lazy CAMEL backend build. Secrets via env only (literal api_key rejected); fallback off by default. - llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not LLMClient, which is not in the agent LLM path — writing one JSONL record per call with cost estimation and leak flags. - run_reddit_simulation.py: --model-map flag, per-agent routed backends, redacted model_routing_audit.jsonl, round-stamped telemetry. - scripts/export_telemetry.py: standalone CSV + summary export (stdlib only). - configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/ recipe, docs/multimodel_agents.md. - tests/test_model_routing.py: 21 tests (validation, precedence, secrets, cost, telemetry wrapper). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…LM telemetry (#21) Supersedes the spike's inline agent_configs llm_* routing with the configurable agent_model_map.yaml router + per-call telemetry, as the spike itself called for. Spike evidence docs are preserved. # Conflicts: # backend/scripts/run_reddit_simulation.py

- run_parallel_simulation.py is single-model per platform (not wired): concurrent platforms make a shared sink.current_round racy; full wiring needs per-platform sinks/round contexts. - SDK-internal retries are below the instrumented run()/arun(): one telemetry row per top-level call (final usage or final error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…#21) Closes the issue's 'Smoke run con 2 modelos reales' checkbox: - 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite (by_agent_id), default -> gemini-3.1-flash-lite - every call traceable to (model, provider, tokens, cost, round) in llm_telemetry.jsonl; routing audit + CSV/JSONL export committed - adds the no-GPU variant (any multi-model OpenAI-compatible endpoint) alongside the original local-vLLM recipe Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Joacocade and others added 14 commits May 22, 2026 16:40

feat: spike per-agent local LLM routing

c62d1e1

chore: add Argentina 2025 pilot case artifacts

d68c4d6

chore: keep pilot artifacts isolated to cases

9010e8b

refactor report agent localization guards

b824bce

feat: add scheduled reddit injection runner

cbd7e6d

docs: add s2 issue19 benchmark packet

eb88f17

docs: add issue19 acceptance response

dcae4cf

docs: clarify issue19 evidence policy

f71f8a6

feat: add artifact-only report agent evaluation

3f77489

docs: polish report agent readiness notes

36db92a

Joacocade marked this pull request as ready for review June 8, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] S2 Issue 19 scheduled injection and ReportAgent evaluation#25

[codex] S2 Issue 19 scheduled injection and ReportAgent evaluation#25
Joacocade wants to merge 14 commits into
mainfrom
codex/s2-issue19-baseline

Joacocade commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Joacocade commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issue

Summary

Evidence

How to test

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Joacocade commented Jun 8, 2026 •

edited

Loading