[codex] S2 Issue 19 scheduled injection and ReportAgent evaluation#25
Open
Joacocade wants to merge 14 commits into
Open
[codex] S2 Issue 19 scheduled injection and ReportAgent evaluation#25Joacocade wants to merge 14 commits into
Joacocade wants to merge 14 commits into
Conversation
Route OASIS simulation agents to different LLMs via a YAML model map and record per-call telemetry (tokens, latency, estimated cost) so every agent action is traceable to the model that produced it. Fully opt-in via --model-map; single-model behavior is unchanged without it. - model_router.py: load/validate model map, resolve ModelPolicy per agent (precedence by_agent_id > by_role > default), lazy CAMEL backend build. Secrets via env only (literal api_key rejected); fallback off by default. - llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not LLMClient, which is not in the agent LLM path — writing one JSONL record per call with cost estimation and leak flags. - run_reddit_simulation.py: --model-map flag, per-agent routed backends, redacted model_routing_audit.jsonl, round-stamped telemetry. - scripts/export_telemetry.py: standalone CSV + summary export (stdlib only). - configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/ recipe, docs/multimodel_agents.md. - tests/test_model_routing.py: 21 tests (validation, precedence, secrets, cost, telemetry wrapper). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LM telemetry (#21) Supersedes the spike's inline agent_configs llm_* routing with the configurable agent_model_map.yaml router + per-call telemetry, as the spike itself called for. Spike evidence docs are preserved. # Conflicts: # backend/scripts/run_reddit_simulation.py
- run_parallel_simulation.py is single-model per platform (not wired): concurrent platforms make a shared sink.current_round racy; full wiring needs per-platform sinks/round contexts. - SDK-internal retries are below the instrumented run()/arun(): one telemetry row per top-level call (final usage or final error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#21) Closes the issue's 'Smoke run con 2 modelos reales' checkbox: - 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite (by_agent_id), default -> gemini-3.1-flash-lite - every call traceable to (model, provider, tokens, cost, round) in llm_telemetry.jsonl; routing audit + CSV/JSONL export committed - adds the no-GPU variant (any multi-model OpenAI-compatible endpoint) alongside the original local-vLLM recipe Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked issue
Closes #19
Summary
event_config.scheduled_events.artifact_onlymode so per-condition reports can be generated from isolated evidence bundles without shared graph/Zep tool reads.Evidence
backtesting/case-a-s2-positional-noise/ISSUE_RESPONSE.mdbacktesting/case-a-s2-positional-noise/evaluation/final_issue_report.mdbacktesting/case-a-s2-positional-noise-v2/evaluation/final_v2_report.mdbacktesting/case-a-s2-positional-noise-v2/evaluation_deepinfra/final_deepinfra_report.mdbacktesting/case-a-s2-positional-noise-v2/evaluation_report_agent/report_agent_manifest.csvReportAgent manifest status: 18 rows, 18 completed, 0 errors.
How to test
From
backend/:Expected result:
23 passed.ReportAgent artifact-only verification already completed for:
qwen/qwen3-8bgoogle/gemma-3-27b-itmeta-llama/Llama-3.3-70B-Instruct-TurboThe committed manifest at
backtesting/case-a-s2-positional-noise-v2/evaluation_report_agent/report_agent_manifest.csvhas 18 rows, 18 completed, and 0 errors.Also verified no
<tool_call>,<tool_code>,Thought, or tool-failure leakage in committed ReportAgent outputs.Notes
The
runs/SQLite/log artifacts remain local reproducibility evidence and are intentionally not committed. The committed PR evidence is the compact summaries, metrics, narrative scores, final reports, and ReportAgent outputs.