S2: Multi-model agents (configurable routing) + LLM observability by Joacocade · Pull Request #14 · LucasErcolano/MiroFish

Joacocade · 2026-05-22T19:43:37Z

Summary

This PR now contains the final S2 feature (issue #21), built on top of the S1 spike it originally introduced.

S1 spike (original scope, by @Joacocade): validated that two OASIS Reddit agents can be pinned to two different OpenAI-compatible local model endpoints. Evidence docs preserved in docs/.
S2 feature (issue S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21): converts the spike into a configurable, auditable feature — per-agent/role model routing via agent_model_map.yaml + mandatory per-call LLM telemetry with CSV/JSONL export.

Linked issue

Closes #8
Closes #21

What changed (S2)

backend/app/services/model_router.py — configurable routing with precedence by_agent_id > by_role > default, fallback off by default, secrets via env only (validate_model_map rejects literal api_key in YAML).
backend/app/services/llm_telemetry.py — wraps the CAMEL backend instance (the actual agent LLM path; LLMClient is never in it). Per call: tokens in/out, latency, estimated cost, prompt/response hashes, JSON validity, errors, round.
backend/scripts/run_reddit_simulation.py — --model-map flag, per-agent routed + instrumented backends, model_routing_audit.jsonl + llm_telemetry.jsonl, round threaded per env.step. Single-model behavior unchanged when no map is given.
scripts/export_telemetry.py — standalone (stdlib-only) CSV + JSONL summary export.
configs/model_map_example.yaml, configs/model_prices.yaml, docs/multimodel_agents.md, runs/smoke_multimodel/ — real 2-model smoke run executed (Gemini endpoint, no GPU): 18 calls, 9 per model, artifacts committed (llm_telemetry.jsonl, routing audit, CSV/JSONL export).
backend/tests/test_model_routing.py — 21 tests (mock providers), all green.

Note for @Joacocade: the spike's inline agent_configs[].llm_* routing was superseded by the agent_model_map.yaml router — exactly the S2 replacement the spike called for. Your spike commit and evidence docs are preserved in this branch's history.

Config files (canonical vs example)

Issue #21 lists agent_model_map.yaml and configs/model_map_example.yaml as separate deliverables — they coexist by design:

configs/model_map_example.yaml — annotated template. Copy it, don't run it.
agent_model_map.yaml (user-provided, passed via --model-map) — the canonical runtime config for a real run. Lives wherever the researcher puts it; the runner takes the path as an argument.
runs/smoke_multimodel/agent_model_map.yaml + agent_model_map.gemini.yaml — frozen smoke evidence, not meant for reuse.

Documented under "Canonical config file" in docs/multimodel_agents.md.

Telemetry schema coverage (#21)

Every field the issue requires is present in llm_telemetry.jsonl and carried through to telemetry.csv: agent_id, role, provider, model, prompt_hash, response_hash, tokens_in, tokens_out, latency_ms, round, cost_usd_est, temperature, output_valid_json (JSON validation), error, leak_flags. Retries are explicitly not a separate field by design (SDK-internal, below the instrumented call) — documented as stable in "Telemetry record schema" + "Scope & limitations". Full field-by-field mapping in docs/multimodel_agents.md.

Scope & limitations

run_parallel_simulation.py is not wired (single-model per platform); documented in docs/multimodel_agents.md.
SDK-internal retries are below the instrumented call: one telemetry row per top-level call.
experiment_runner.py integration (populate ExperimentResult multimodel fields) is deferred until S2: Wiki-backed Report Memory para auditoría temporal #20 (PR feat(issue-20): S2 Wiki-backed Report Memory #23) merges — will land as an isolated commit after rebase.

How to test

cd backend && env -u PYTHONPATH .venv/bin/python -m pytest tests/test_model_routing.py -v   # 21 passed
python scripts/export_telemetry.py --input <sim_dir> --out-csv out.csv --out-summary out.jsonl

🤖 Generated with Claude Code

LucasErcolano · 2026-05-22T20:52:52Z

Revisión contra la issue #8:

Veredicto: cumple bastante bien lo pedido para un spike S1. No mergeé nada.

Checklist de aceptación:

Nota técnica sobre dónde se instancia/usa el LLM: OK. Los docs explican el uso vía OpenAI-compatible API y la limitación de OASIS generate_reddit_agent_graph(...) con un único model.
Config mínima de modelo por agente: OK. Se agregan llm_model, llm_base_url y llm_api_key en agent_configs[].
Run de prueba con 2 modelos/configs distintos: OK según la evidencia documentada. Hay run con Qwen AWQ y Mistral AWQ, exit code 0, sin timeout ni error OASIS/CAMEL.
Logs auditables por agente/modelo: OK para S1, aunque no es logging por cada respuesta individual. model_routing_audit.jsonl deja auditado agent_id -> model/base_url.
Limitaciones/S2: OK. Queda claro que esto es Reddit-only, sin UI, routing dinámico, fallback ni provider management, y se propone una extensión OASIS tipo model_by_agent_id.

Checks locales que corrí:

python -m py_compile backend/scripts/run_reddit_simulation.py: OK.
Último PR Hygiene check: success.

Observaciones menores:

La copia local de la construcción del graph de OASIS está justificada para S1, pero no debería crecer así en S2.
Solo cubre Reddit runner; no Twitter/parallel.
Si algún agent_id viniera como string en config, el mapeo podría no matchear contra el range(agent_count) entero. Probablemente no afecta la config actual.

Conclusión: para el alcance de la issue #8, lo considero compatible con merge después del OK del maintainer.

Route OASIS simulation agents to different LLMs via a YAML model map and record per-call telemetry (tokens, latency, estimated cost) so every agent action is traceable to the model that produced it. Fully opt-in via --model-map; single-model behavior is unchanged without it. - model_router.py: load/validate model map, resolve ModelPolicy per agent (precedence by_agent_id > by_role > default), lazy CAMEL backend build. Secrets via env only (literal api_key rejected); fallback off by default. - llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not LLMClient, which is not in the agent LLM path — writing one JSONL record per call with cost estimation and leak flags. - run_reddit_simulation.py: --model-map flag, per-agent routed backends, redacted model_routing_audit.jsonl, round-stamped telemetry. - scripts/export_telemetry.py: standalone CSV + summary export (stdlib only). - configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/ recipe, docs/multimodel_agents.md. - tests/test_model_routing.py: 21 tests (validation, precedence, secrets, cost, telemetry wrapper). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…LM telemetry (#21) Supersedes the spike's inline agent_configs llm_* routing with the configurable agent_model_map.yaml router + per-call telemetry, as the spike itself called for. Spike evidence docs are preserved. # Conflicts: # backend/scripts/run_reddit_simulation.py

- run_parallel_simulation.py is single-model per platform (not wired): concurrent platforms make a shared sink.current_round racy; full wiring needs per-platform sinks/round contexts. - SDK-internal retries are below the instrumented run()/arun(): one telemetry row per top-level call (final usage or final error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…#21) Closes the issue's 'Smoke run con 2 modelos reales' checkbox: - 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite (by_agent_id), default -> gemini-3.1-flash-lite - every call traceable to (model, provider, tokens, cost, round) in llm_telemetry.jsonl; routing audit + CSV/JSONL export committed - adds the no-GPU variant (any multi-model OpenAI-compatible endpoint) alongside the original local-vLLM recipe Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LucasErcolano · 2026-06-07T15:43:13Z

Revisión de la actualización S2 contra #21:

El PR avanzó mucho y ya contiene la mayor parte de lo que pide la issue #21: model routing configurable, telemetry por llamada, docs, tests y smoke con 2 modelos reales. Pero todavía faltan/conviene corregir estos puntos antes de considerarlo cierre completo:

CI rojo por PR Hygiene: el workflow espera la sección exacta ## Linked issue, pero el body usa ## Linked issues. Cambiar el heading a singular:

## Linked issue

y dejar abajo:
- Closes #8
- Closes #21
Artefacto agent_model_map.yaml: la issue pide explícitamente agent_model_map.yaml. El PR tiene configs/model_map_example.yaml y también runs/smoke_multimodel/agent_model_map.yaml. Dejar claro en docs/body cuál es el archivo de configuración canónico para una run real y cuál es solo ejemplo. Si corresponde, agregar un agent_model_map.yaml canónico o renombrar/documentar el mapping.
Scope muy grande: el PR ahora mezcla el S1 original, la feature S2 de S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21 y muchos artefactos de cases/PILOT-ARG-2025-Q1. Los archivos relevantes para S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21 están bien (model_router.py, llm_telemetry.py, test_model_routing.py, docs/multimodel_agents.md, configs/model_map_example.yaml, runs/smoke_multimodel/, scripts/export_telemetry.py), pero conviene separar o justificar explícitamente los artefactos que no son parte directa de multi-model agents/observabilidad.
Smoke con 2 modelos reales: la smoke documentada con endpoint Gemini parece válida para cumplir “2 modelos reales” sin GPU local, pero dejar explícito que esa smoke es la evidencia S2 final y que los modelos usados quedan auditados en model_routing_audit.jsonl + llm_telemetry.jsonl + telemetry.csv.
Reproducibility pack: S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21 pide 1 comando para correr un caso, 1 comando para comparar/exportar resultados, validación de configs, estructura estándar de runs y README para investigadores. La doc cubre buena parte, pero conviene hacer una sección checklist explícita en docs/multimodel_agents.md o runs/smoke_multimodel/README.md con esos comandos exactos.
Telemetry schema: verificar/documentar que llm_telemetry.jsonl incluye todo lo exigido: agent_id, role, provider, model, prompt_hash, response_hash, tokens_in, tokens_out, latency_ms, round, cost_usd_est, temperature, errores/retries, validación JSON y leak flags. Si algún campo no aplica, dejarlo explícito y estable.

En resumen: funcionalmente parece muy cerca de cumplir #21, pero no está listo mientras CI esté rojo y mientras no quede clara la canonicidad de agent_model_map.yaml, la smoke final y el reproducibility pack.

@LucasErcolano

field coverage (#21) Addresses PR #14 review by @LucasErcolano: - Canonical config file section: agent_model_map.yaml (runtime) vs configs/model_map_example.yaml (template) vs smoke evidence maps. - Smoke run section now states the real 2-model run was executed (Gemini, no GPU) and is the final S2 evidence — fixes the stale "deferred" wording that contradicted README.md. - Telemetry: explicit Issue #21 required-field coverage table, retries documented as stable (SDK-internal, not a separate field). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: spike per-agent local LLM routing

c62d1e1

LucasErcolano mentioned this pull request May 27, 2026

S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21

Open

28 tasks

LucasErcolano and others added 6 commits June 1, 2026 14:41

chore: add Argentina 2025 pilot case artifacts

d68c4d6

chore: keep pilot artifacts isolated to cases

9010e8b

refactor report agent localization guards

b824bce

elianaostro changed the title ~~S1 spike: heterogeneous local LLM agents~~ S2: Multi-model agents (configurable routing) + LLM observability Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S2: Multi-model agents (configurable routing) + LLM observability#14

S2: Multi-model agents (configurable routing) + LLM observability#14
Joacocade wants to merge 9 commits into
mainfrom
codex/s1-local-llm-spike

Joacocade commented May 22, 2026 •

edited by elianaostro

Loading

Uh oh!

LucasErcolano commented May 22, 2026

Uh oh!

LucasErcolano commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Joacocade commented May 22, 2026 • edited by elianaostro Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Linked issue

What changed (S2)

Config files (canonical vs example)

Telemetry schema coverage (#21)

Scope & limitations

How to test

Uh oh!

LucasErcolano commented May 22, 2026

Uh oh!

LucasErcolano commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Joacocade commented May 22, 2026 •

edited by elianaostro

Loading