Skip to content

S2: Multi-model agents (configurable routing) + LLM observability#14

Open
Joacocade wants to merge 9 commits into
mainfrom
codex/s1-local-llm-spike
Open

S2: Multi-model agents (configurable routing) + LLM observability#14
Joacocade wants to merge 9 commits into
mainfrom
codex/s1-local-llm-spike

Conversation

@Joacocade

@Joacocade Joacocade commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR now contains the final S2 feature (issue #21), built on top of the S1 spike it originally introduced.

  • S1 spike (original scope, by @Joacocade): validated that two OASIS Reddit agents can be pinned to two different OpenAI-compatible local model endpoints. Evidence docs preserved in docs/.
  • S2 feature (issue S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21): converts the spike into a configurable, auditable feature — per-agent/role model routing via agent_model_map.yaml + mandatory per-call LLM telemetry with CSV/JSONL export.

Linked issue

Closes #8
Closes #21

What changed (S2)

  • backend/app/services/model_router.py — configurable routing with precedence by_agent_id > by_role > default, fallback off by default, secrets via env only (validate_model_map rejects literal api_key in YAML).
  • backend/app/services/llm_telemetry.py — wraps the CAMEL backend instance (the actual agent LLM path; LLMClient is never in it). Per call: tokens in/out, latency, estimated cost, prompt/response hashes, JSON validity, errors, round.
  • backend/scripts/run_reddit_simulation.py--model-map flag, per-agent routed + instrumented backends, model_routing_audit.jsonl + llm_telemetry.jsonl, round threaded per env.step. Single-model behavior unchanged when no map is given.
  • scripts/export_telemetry.py — standalone (stdlib-only) CSV + JSONL summary export.
  • configs/model_map_example.yaml, configs/model_prices.yaml, docs/multimodel_agents.md, runs/smoke_multimodel/real 2-model smoke run executed (Gemini endpoint, no GPU): 18 calls, 9 per model, artifacts committed (llm_telemetry.jsonl, routing audit, CSV/JSONL export).
  • backend/tests/test_model_routing.py — 21 tests (mock providers), all green.

Note for @Joacocade: the spike's inline agent_configs[].llm_* routing was superseded by the agent_model_map.yaml router — exactly the S2 replacement the spike called for. Your spike commit and evidence docs are preserved in this branch's history.

Config files (canonical vs example)

Issue #21 lists agent_model_map.yaml and configs/model_map_example.yaml as separate deliverables — they coexist by design:

  • configs/model_map_example.yaml — annotated template. Copy it, don't run it.
  • agent_model_map.yaml (user-provided, passed via --model-map) — the canonical runtime config for a real run. Lives wherever the researcher puts it; the runner takes the path as an argument.
  • runs/smoke_multimodel/agent_model_map.yaml + agent_model_map.gemini.yamlfrozen smoke evidence, not meant for reuse.

Documented under "Canonical config file" in docs/multimodel_agents.md.

Telemetry schema coverage (#21)

Every field the issue requires is present in llm_telemetry.jsonl and carried through to telemetry.csv: agent_id, role, provider, model, prompt_hash, response_hash, tokens_in, tokens_out, latency_ms, round, cost_usd_est, temperature, output_valid_json (JSON validation), error, leak_flags. Retries are explicitly not a separate field by design (SDK-internal, below the instrumented call) — documented as stable in "Telemetry record schema" + "Scope & limitations". Full field-by-field mapping in docs/multimodel_agents.md.

Scope & limitations

How to test

cd backend && env -u PYTHONPATH .venv/bin/python -m pytest tests/test_model_routing.py -v   # 21 passed
python scripts/export_telemetry.py --input <sim_dir> --out-csv out.csv --out-summary out.jsonl

🤖 Generated with Claude Code

@LucasErcolano

Copy link
Copy Markdown
Owner

Revisión contra la issue #8:

Veredicto: cumple bastante bien lo pedido para un spike S1. No mergeé nada.

Checklist de aceptación:

  • Nota técnica sobre dónde se instancia/usa el LLM: OK. Los docs explican el uso vía OpenAI-compatible API y la limitación de OASIS generate_reddit_agent_graph(...) con un único model.
  • Config mínima de modelo por agente: OK. Se agregan llm_model, llm_base_url y llm_api_key en agent_configs[].
  • Run de prueba con 2 modelos/configs distintos: OK según la evidencia documentada. Hay run con Qwen AWQ y Mistral AWQ, exit code 0, sin timeout ni error OASIS/CAMEL.
  • Logs auditables por agente/modelo: OK para S1, aunque no es logging por cada respuesta individual. model_routing_audit.jsonl deja auditado agent_id -> model/base_url.
  • Limitaciones/S2: OK. Queda claro que esto es Reddit-only, sin UI, routing dinámico, fallback ni provider management, y se propone una extensión OASIS tipo model_by_agent_id.

Checks locales que corrí:

  • python -m py_compile backend/scripts/run_reddit_simulation.py: OK.
  • Último PR Hygiene check: success.

Observaciones menores:

  • La copia local de la construcción del graph de OASIS está justificada para S1, pero no debería crecer así en S2.
  • Solo cubre Reddit runner; no Twitter/parallel.
  • Si algún agent_id viniera como string en config, el mapeo podría no matchear contra el range(agent_count) entero. Probablemente no afecta la config actual.

Conclusión: para el alcance de la issue #8, lo considero compatible con merge después del OK del maintainer.

LucasErcolano and others added 6 commits June 1, 2026 14:41
Route OASIS simulation agents to different LLMs via a YAML model map and
record per-call telemetry (tokens, latency, estimated cost) so every agent
action is traceable to the model that produced it. Fully opt-in via
--model-map; single-model behavior is unchanged without it.

- model_router.py: load/validate model map, resolve ModelPolicy per agent
  (precedence by_agent_id > by_role > default), lazy CAMEL backend build.
  Secrets via env only (literal api_key rejected); fallback off by default.
- llm_telemetry.py: instrument the CAMEL backend INSTANCE (run/arun) — not
  LLMClient, which is not in the agent LLM path — writing one JSONL record
  per call with cost estimation and leak flags.
- run_reddit_simulation.py: --model-map flag, per-agent routed backends,
  redacted model_routing_audit.jsonl, round-stamped telemetry.
- scripts/export_telemetry.py: standalone CSV + summary export (stdlib only).
- configs/model_map_example.yaml + model_prices.yaml, runs/smoke_multimodel/
  recipe, docs/multimodel_agents.md.
- tests/test_model_routing.py: 21 tests (validation, precedence, secrets,
  cost, telemetry wrapper).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…LM telemetry (#21)

Supersedes the spike's inline agent_configs llm_* routing with the
configurable agent_model_map.yaml router + per-call telemetry, as the
spike itself called for. Spike evidence docs are preserved.

# Conflicts:
#	backend/scripts/run_reddit_simulation.py
- run_parallel_simulation.py is single-model per platform (not wired):
  concurrent platforms make a shared sink.current_round racy; full wiring
  needs per-platform sinks/round contexts.
- SDK-internal retries are below the instrumented run()/arun(): one
  telemetry row per top-level call (final usage or final error).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@elianaostro elianaostro changed the title S1 spike: heterogeneous local LLM agents S2: Multi-model agents (configurable routing) + LLM observability Jun 7, 2026
…#21)

Closes the issue's 'Smoke run con 2 modelos reales' checkbox:
- 18 LLM calls, 9 per model; agents 0-9 -> gemini-2.5-flash-lite
  (by_agent_id), default -> gemini-3.1-flash-lite
- every call traceable to (model, provider, tokens, cost, round) in
  llm_telemetry.jsonl; routing audit + CSV/JSONL export committed
- adds the no-GPU variant (any multi-model OpenAI-compatible endpoint)
  alongside the original local-vLLM recipe

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@LucasErcolano

Copy link
Copy Markdown
Owner

Revisión de la actualización S2 contra #21:

El PR avanzó mucho y ya contiene la mayor parte de lo que pide la issue #21: model routing configurable, telemetry por llamada, docs, tests y smoke con 2 modelos reales. Pero todavía faltan/conviene corregir estos puntos antes de considerarlo cierre completo:

  • CI rojo por PR Hygiene: el workflow espera la sección exacta ## Linked issue, pero el body usa ## Linked issues. Cambiar el heading a singular:

    ## Linked issue

    y dejar abajo:

    • Closes #8
    • Closes #21
  • Artefacto agent_model_map.yaml: la issue pide explícitamente agent_model_map.yaml. El PR tiene configs/model_map_example.yaml y también runs/smoke_multimodel/agent_model_map.yaml. Dejar claro en docs/body cuál es el archivo de configuración canónico para una run real y cuál es solo ejemplo. Si corresponde, agregar un agent_model_map.yaml canónico o renombrar/documentar el mapping.

  • Scope muy grande: el PR ahora mezcla el S1 original, la feature S2 de S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21 y muchos artefactos de cases/PILOT-ARG-2025-Q1. Los archivos relevantes para S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21 están bien (model_router.py, llm_telemetry.py, test_model_routing.py, docs/multimodel_agents.md, configs/model_map_example.yaml, runs/smoke_multimodel/, scripts/export_telemetry.py), pero conviene separar o justificar explícitamente los artefactos que no son parte directa de multi-model agents/observabilidad.

  • Smoke con 2 modelos reales: la smoke documentada con endpoint Gemini parece válida para cumplir “2 modelos reales” sin GPU local, pero dejar explícito que esa smoke es la evidencia S2 final y que los modelos usados quedan auditados en model_routing_audit.jsonl + llm_telemetry.jsonl + telemetry.csv.

  • Reproducibility pack: S2 - Dev 2: Multi-model agents como feature final + Observabilidad #21 pide 1 comando para correr un caso, 1 comando para comparar/exportar resultados, validación de configs, estructura estándar de runs y README para investigadores. La doc cubre buena parte, pero conviene hacer una sección checklist explícita en docs/multimodel_agents.md o runs/smoke_multimodel/README.md con esos comandos exactos.

  • Telemetry schema: verificar/documentar que llm_telemetry.jsonl incluye todo lo exigido: agent_id, role, provider, model, prompt_hash, response_hash, tokens_in, tokens_out, latency_ms, round, cost_usd_est, temperature, errores/retries, validación JSON y leak flags. Si algún campo no aplica, dejarlo explícito y estable.

En resumen: funcionalmente parece muy cerca de cumplir #21, pero no está listo mientras CI esté rojo y mientras no quede clara la canonicidad de agent_model_map.yaml, la smoke final y el reproducibility pack.

 field coverage (#21)

Addresses PR #14 review by @LucasErcolano:
- Canonical config file section: agent_model_map.yaml (runtime) vs
  configs/model_map_example.yaml (template) vs smoke evidence maps.
- Smoke run section now states the real 2-model run was executed
  (Gemini, no GPU) and is the final S2 evidence — fixes the stale
  "deferred" wording that contradicted README.md.
- Telemetry: explicit Issue #21 required-field coverage table,
  retries documented as stable (SDK-internal, not a separate field).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S2 - Dev 2: Multi-model agents como feature final + Observabilidad S1 - Spike: heterogeneous LLM agents in one simulation run

3 participants