Skip to content

[codex] S2 Issue 19 scheduled injection and ReportAgent evaluation#25

Open
Joacocade wants to merge 14 commits into
mainfrom
codex/s2-issue19-baseline
Open

[codex] S2 Issue 19 scheduled injection and ReportAgent evaluation#25
Joacocade wants to merge 14 commits into
mainfrom
codex/s2-issue19-baseline

Conversation

@Joacocade

@Joacocade Joacocade commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Linked issue

Closes #19

Summary

  • Add scheduled intra-run Reddit injection support for Issue 19 using event_config.scheduled_events.
  • Add the S2 Issue 19 benchmark packet, acceptance response, evidence policy, V1/V2 evaluation artifacts, and DeepInfra robustness artifacts.
  • Add ReportAgent artifact_only mode so per-condition reports can be generated from isolated evidence bundles without shared graph/Zep tool reads.
  • Generate artifact-only ReportAgent outputs for Qwen, Gemma, and Llama across all six V2 conditions.

Evidence

  • Issue checklist: backtesting/case-a-s2-positional-noise/ISSUE_RESPONSE.md
  • V1 report: backtesting/case-a-s2-positional-noise/evaluation/final_issue_report.md
  • V2 report: backtesting/case-a-s2-positional-noise-v2/evaluation/final_v2_report.md
  • DeepInfra report: backtesting/case-a-s2-positional-noise-v2/evaluation_deepinfra/final_deepinfra_report.md
  • ReportAgent manifest: backtesting/case-a-s2-positional-noise-v2/evaluation_report_agent/report_agent_manifest.csv

ReportAgent manifest status: 18 rows, 18 completed, 0 errors.

How to test

From backend/:

uv run --frozen pytest ../tests/test_report_agent_resilience.py -q

Expected result: 23 passed.

ReportAgent artifact-only verification already completed for:

  • qwen/qwen3-8b
  • google/gemma-3-27b-it
  • meta-llama/Llama-3.3-70B-Instruct-Turbo

The committed manifest at backtesting/case-a-s2-positional-noise-v2/evaluation_report_agent/report_agent_manifest.csv has 18 rows, 18 completed, and 0 errors.

Also verified no <tool_call>, <tool_code>, Thought, or tool-failure leakage in committed ReportAgent outputs.

Notes

The runs/ SQLite/log artifacts remain local reproducibility evidence and are intentionally not committed. The committed PR evidence is the compact summaries, metrics, narrative scores, final reports, and ReportAgent outputs.

Joacocade and others added 14 commits May 22, 2026 16:40
Route OASIS simulation agents to different LLMs via a YAML model map and
record per-call telemetry (tokens, latency, estimated cost) so every agent
action is traceable to the model that produced it. Fully opt-in via
--model-map; single-model behavior is unchanged without it.

- model_router.py: load/validate model map, resolve ModelPolicy per agent
  (precedence by_agent_id > by_role > default), lazy CAMEL backend build.
  Secrets via env only (literal api_key rejected); fallback off by default.
- llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not
  LLMClient, which is not in the agent LLM path — writing one JSONL record
  per call with cost estimation and leak flags.
- run_reddit_simulation.py: --model-map flag, per-agent routed backends,
  redacted model_routing_audit.jsonl, round-stamped telemetry.
- scripts/export_telemetry.py: standalone CSV + summary export (stdlib only).
- configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/
  recipe, docs/multimodel_agents.md.
- tests/test_model_routing.py: 21 tests (validation, precedence, secrets,
  cost, telemetry wrapper).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LM telemetry (#21)

Supersedes the spike's inline agent_configs llm_* routing with the
configurable agent_model_map.yaml router + per-call telemetry, as the
spike itself called for. Spike evidence docs are preserved.

# Conflicts:
#	backend/scripts/run_reddit_simulation.py
- run_parallel_simulation.py is single-model per platform (not wired):
  concurrent platforms make a shared sink.current_round racy; full wiring
  needs per-platform sinks/round contexts.
- SDK-internal retries are below the instrumented run()/arun(): one
  telemetry row per top-level call (final usage or final error).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#21)

Closes the issue's 'Smoke run con 2 modelos reales' checkbox:
- 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite
  (by_agent_id), default -> gemini-3.1-flash-lite
- every call traceable to (model, provider, tokens, cost, round) in
  llm_telemetry.jsonl; routing audit + CSV/JSONL export committed
- adds the no-GPU variant (any multi-model OpenAI-compatible endpoint)
  alongside the original local-vLLM recipe

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Joacocade Joacocade marked this pull request as ready for review June 8, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S2 - Investigador 3: Sensibilidad posicional (Línea 3) + Ruido temporal (Línea 4) combinados

3 participants