S2: Multi-model agents (configurable routing) + LLM observability#14
S2: Multi-model agents (configurable routing) + LLM observability#14Joacocade wants to merge 9 commits into
Conversation
|
Revisión contra la issue #8: Veredicto: cumple bastante bien lo pedido para un spike S1. No mergeé nada. Checklist de aceptación:
Checks locales que corrí:
Observaciones menores:
Conclusión: para el alcance de la issue #8, lo considero compatible con merge después del OK del maintainer. |
Route OASIS simulation agents to different LLMs via a YAML model map and record per-call telemetry (tokens, latency, estimated cost) so every agent action is traceable to the model that produced it. Fully opt-in via --model-map; single-model behavior is unchanged without it. - model_router.py: load/validate model map, resolve ModelPolicy per agent (precedence by_agent_id > by_role > default), lazy CAMEL backend build. Secrets via env only (literal api_key rejected); fallback off by default. - llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not LLMClient, which is not in the agent LLM path — writing one JSONL record per call with cost estimation and leak flags. - run_reddit_simulation.py: --model-map flag, per-agent routed backends, redacted model_routing_audit.jsonl, round-stamped telemetry. - scripts/export_telemetry.py: standalone CSV + summary export (stdlib only). - configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/ recipe, docs/multimodel_agents.md. - tests/test_model_routing.py: 21 tests (validation, precedence, secrets, cost, telemetry wrapper). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LM telemetry (#21) Supersedes the spike's inline agent_configs llm_* routing with the configurable agent_model_map.yaml router + per-call telemetry, as the spike itself called for. Spike evidence docs are preserved. # Conflicts: # backend/scripts/run_reddit_simulation.py
- run_parallel_simulation.py is single-model per platform (not wired): concurrent platforms make a shared sink.current_round racy; full wiring needs per-platform sinks/round contexts. - SDK-internal retries are below the instrumented run()/arun(): one telemetry row per top-level call (final usage or final error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#21) Closes the issue's 'Smoke run con 2 modelos reales' checkbox: - 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite (by_agent_id), default -> gemini-3.1-flash-lite - every call traceable to (model, provider, tokens, cost, round) in llm_telemetry.jsonl; routing audit + CSV/JSONL export committed - adds the no-GPU variant (any multi-model OpenAI-compatible endpoint) alongside the original local-vLLM recipe Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Revisión de la actualización S2 contra #21: El PR avanzó mucho y ya contiene la mayor parte de lo que pide la issue #21: model routing configurable, telemetry por llamada, docs, tests y smoke con 2 modelos reales. Pero todavía faltan/conviene corregir estos puntos antes de considerarlo cierre completo:
En resumen: funcionalmente parece muy cerca de cumplir #21, pero no está listo mientras CI esté rojo y mientras no quede clara la canonicidad de |
field coverage (#21) Addresses PR #14 review by @LucasErcolano: - Canonical config file section: agent_model_map.yaml (runtime) vs configs/model_map_example.yaml (template) vs smoke evidence maps. - Smoke run section now states the real 2-model run was executed (Gemini, no GPU) and is the final S2 evidence — fixes the stale "deferred" wording that contradicted README.md. - Telemetry: explicit Issue #21 required-field coverage table, retries documented as stable (SDK-internal, not a separate field). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
This PR now contains the final S2 feature (issue #21), built on top of the S1 spike it originally introduced.
docs/.agent_model_map.yaml+ mandatory per-call LLM telemetry with CSV/JSONL export.Linked issue
Closes #8
Closes #21
What changed (S2)
backend/app/services/model_router.py— configurable routing with precedenceby_agent_id > by_role > default, fallback off by default, secrets via env only (validate_model_maprejects literalapi_keyin YAML).backend/app/services/llm_telemetry.py— wraps the CAMEL backend instance (the actual agent LLM path;LLMClientis never in it). Per call: tokens in/out, latency, estimated cost, prompt/response hashes, JSON validity, errors, round.backend/scripts/run_reddit_simulation.py—--model-mapflag, per-agent routed + instrumented backends,model_routing_audit.jsonl+llm_telemetry.jsonl, round threaded perenv.step. Single-model behavior unchanged when no map is given.scripts/export_telemetry.py— standalone (stdlib-only) CSV + JSONL summary export.configs/model_map_example.yaml,configs/model_prices.yaml,docs/multimodel_agents.md,runs/smoke_multimodel/— real 2-model smoke run executed (Gemini endpoint, no GPU): 18 calls, 9 per model, artifacts committed (llm_telemetry.jsonl, routing audit, CSV/JSONL export).backend/tests/test_model_routing.py— 21 tests (mock providers), all green.Config files (canonical vs example)
Issue #21 lists
agent_model_map.yamlandconfigs/model_map_example.yamlas separate deliverables — they coexist by design:configs/model_map_example.yaml— annotated template. Copy it, don't run it.agent_model_map.yaml(user-provided, passed via--model-map) — the canonical runtime config for a real run. Lives wherever the researcher puts it; the runner takes the path as an argument.runs/smoke_multimodel/agent_model_map.yaml+agent_model_map.gemini.yaml— frozen smoke evidence, not meant for reuse.Documented under "Canonical config file" in
docs/multimodel_agents.md.Telemetry schema coverage (#21)
Every field the issue requires is present in
llm_telemetry.jsonland carried through totelemetry.csv:agent_id, role, provider, model, prompt_hash, response_hash, tokens_in, tokens_out, latency_ms, round, cost_usd_est, temperature, output_valid_json (JSON validation), error, leak_flags. Retries are explicitly not a separate field by design (SDK-internal, below the instrumented call) — documented as stable in "Telemetry record schema" + "Scope & limitations". Full field-by-field mapping indocs/multimodel_agents.md.Scope & limitations
run_parallel_simulation.pyis not wired (single-model per platform); documented indocs/multimodel_agents.md.experiment_runner.pyintegration (populateExperimentResultmultimodel fields) is deferred until S2: Wiki-backed Report Memory para auditoría temporal #20 (PR feat(issue-20): S2 Wiki-backed Report Memory #23) merges — will land as an isolated commit after rebase.How to test
🤖 Generated with Claude Code