From 30a607ce0635c455931295c92f5d3478c5b2f577 Mon Sep 17 00:00:00 2001 From: Cayenne226 Date: Fri, 8 May 2026 21:17:10 +0800 Subject: [PATCH 1/3] feat: add Cayenne226 member profile and OrchestratorR1 project (AAAI 2027) --- lab/MEMBERS.md | 2 +- members/cayenne/ME.md | 34 +++++++++++++++ projects/OrchestratorR1/GOAL.md | 27 ++++++++++++ projects/OrchestratorR1/PROGRESS.md | 65 +++++++++++++++++++++++++++++ projects/OrchestratorR1/PROJECT.md | 36 ++++++++++++++++ 5 files changed, 163 insertions(+), 1 deletion(-) create mode 100644 members/cayenne/ME.md create mode 100644 projects/OrchestratorR1/GOAL.md create mode 100644 projects/OrchestratorR1/PROGRESS.md create mode 100644 projects/OrchestratorR1/PROJECT.md diff --git a/lab/MEMBERS.md b/lab/MEMBERS.md index 47ea3ff..f135519 100644 --- a/lab/MEMBERS.md +++ b/lab/MEMBERS.md @@ -5,7 +5,7 @@ | Name | Role | Responsibilities | Projects | Can Help With | Joined | |------|------|-----------------|----------|---------------|--------| -| | Admin | | | | | +| Cayenne | Member | RL training, evaluation, paper writing | OrchestratorR1 | GRPO/PPO training, API-only eval pipelines, Qwen2.5 series | 2026-05 | ## Roles diff --git a/members/cayenne/ME.md b/members/cayenne/ME.md new file mode 100644 index 0000000..629c360 --- /dev/null +++ b/members/cayenne/ME.md @@ -0,0 +1,34 @@ +# Cayenne + +## Research Area + +Reinforcement Learning for LLM Agents, Multi-Agent Orchestration, QA / Reasoning + +## Current Projects + +- [OrchestratorR1](../../projects/OrchestratorR1/) — Lead, RL training and evaluation (targeting AAAI 2027) + +## AI Tool + +- [ ] OpenAI Codex CLI +- [x] Claude Code + +## Expertise & Skills + +- **Languages:** Python, Bash +- **Frameworks:** PyTorch, Hugging Face Transformers, trl, veRL +- **Domains:** LLM Training (SFT/GRPO/PPO), Multi-Agent Systems, QA Benchmarking + +## Hardware + +- Local: RTX 3060 Laptop (API-only tasks, no heavy training) +- Remote: A100 access pending (needed for Phase 2 training) + +## Sharing + +- **API-only eval pipelines**: ReAct, Self-Reflection, Direct baselines — can help set up fast +- **veRL / trl training**: Familiar with GRPO/PPO on Qwen2.5 series + +## Links + +- GitHub: [Cayenne226](https://github.com/Cayenne226) diff --git a/projects/OrchestratorR1/GOAL.md b/projects/OrchestratorR1/GOAL.md new file mode 100644 index 0000000..1aeeac8 --- /dev/null +++ b/projects/OrchestratorR1/GOAL.md @@ -0,0 +1,27 @@ +# OrchestratorR1 — Goal + +## Target Venue +**AAAI 2027** +- Abstract deadline: early August 2026 +- Full paper deadline: mid-August 2026 +- Window from today (2026-05-06): ~3 months + +> Previously targeted NeurIPS 2026 (May 2026). Switched to AAAI 2027 to allow proper completion of RL training and ablations. + +## Research Question +Can a base LLM be trained, via RL, to **orchestrate** multiple specialized agents (heterogeneous LLM routes) and aggregate their outputs into a single high-quality answer — outperforming both single-model baselines and fixed pipelines, while staying cost-aware? + +## Hypotheses +1. **H1 — RL > heuristics:** GRPO-trained orchestrator beats Fixed-Pipeline and Direct-* baselines on multi-hop QA + GPQA. +2. **H2 — Cost-aware policy:** A non-zero `cost_coe` produces orchestration policies on the cost/quality Pareto frontier without large accuracy loss. +3. **H3 — Generalization:** Policy trained on NQ + HotpotQA transfers to held-out 2WikiMultihop / MuSiQue / Bamboogle / GPQA. + +## Success Criteria +- Beat strongest baseline (Direct-GPT-4o or Fixed-Pipeline) by ≥3 EM on average across QA suite. +- Demonstrate cost-quality Pareto curve via `cost_coe` sweep. +- Provide ablations: SFT-only vs SFT+GRPO, w/ vs w/o cost penalty, max_turns sensitivity. + +## Non-Goals +- New base model architecture. +- Beating SOTA on any single dataset in absolute terms. +- Production-grade serving infrastructure. diff --git a/projects/OrchestratorR1/PROGRESS.md b/projects/OrchestratorR1/PROGRESS.md new file mode 100644 index 0000000..91a901c --- /dev/null +++ b/projects/OrchestratorR1/PROGRESS.md @@ -0,0 +1,65 @@ +# OrchestratorR1 — Progress + +> **Live status doc.** Update when tasks complete or blockers change. Last updated: 2026-05-06. +> Target: **AAAI 2027** — Abstract early Aug, Full paper mid-Aug 2026. + +## Snapshot +- **Done:** 7 / 33 (added B5 ReAct, B6 Self-Reflection — 2026-05-08) +- **In progress:** agentlab git sync +- **Blocked by:** A100 access (user has 3060 Laptop only) + +## 3-Month Plan (May → Aug 2026) + +### Phase 1 — May 2026: Unblock + finish CPU/API-only work +| ID | Task | Status | Notes | +|---|---|---|---| +| T1.1–T1.4, T1.7 | Data prep (all datasets) | ✅ done | parquet ready | +| B1 | Direct-GPT-4o baseline | ✅ done | QA + GPQA | +| B2 | Direct-Strong baseline | ✅ done | | +| B3 | Direct-Cheap baseline | ✅ done | | +| B4 | Fixed-Pipeline baseline | ✅ done | | +| B5 | ReAct baseline (`eval_react.py`) | ✅ done | EM=0.199 F1=0.308 avg_turns=4.3 avg_cost=$0.00070 (3000 QA) | +| B6 | Self-Reflection baseline | ✅ done | EM=0.323 F1=0.445 avg_turns=5.0 avg_cost=$0.00553 (3000 QA) | +| B7 | Conductor / Router-R1 reference data | ⬜ todo | | +| INF1 | Secure A100 access (cloud / lab cluster) | 🔴 blocker | unblocks Phase 2 | +| W1 | Re-scope writing for AAAI format (7+2 pages) | 🟡 deferred | start in late Jul | + +### Phase 2 — Jun–Jul 2026: Core training + ablations +- SFT warmup on `data/sft_warmup.jsonl` (7B, QLoRA + DDP fallback if needed) +- GRPO main training on QA suite +- Ablations: cost_coe ∈ {0, 0.01, 0.1}, max_turns ∈ {2, 4, 6}, SFT-only vs SFT+GRPO +- Held-out eval: 2Wiki / MuSiQue / Bamboogle / GPQA +- Cost-quality Pareto plot + +### Phase 3 — Early Aug 2026: Writing sprint +- Adapt `intro_draft_aaai2027*.md` to AAAI 7+2 page format +- Method, experiments, related work polish +- Submit abstract → submit full paper + +## Active Blockers +1. **A100 access** — every training task downstream depends on this. Action: investigate cloud (RunPod / Vast / Lambda) vs lab cluster reservation. + +## Recent Changes (from git) +- Switched 7B LoRA → QLoRA (4-bit) + DDP to fix OOM on 3090 +- FSDP → FULL_SHARD (ZeRO-3) attempt for 7B LoRA OOM mitigation +- H200 7B training mode configured (a6d568d) + +## QA Baseline Summary (3000 samples, 2026-05-08) + +| Method | EM | F1 | AvgCost | AvgTurns | +|--------|----|----|---------|----------| +| Direct-GPT-4o | **0.367** | **0.474** | $0.00041 | 1.0 | +| Self-Reflection (GPT-4o ×5) | 0.323 | 0.445 | $0.00553 | 5.0 | +| Direct-Strong | 0.182 | 0.332 | $0.00038 | 1.0 | +| Direct-Cheap | 0.192 | 0.317 | $0.00001 | 1.0 | +| ReAct (Qwen2.5-7B) | 0.199 | 0.308 | $0.00070 | 4.3 | +| Fixed-Pipeline | 0.046 | 0.227 | $0.00255 | 5.1 | + +Key takeaways: +- Self-Reflection costs **13× more** than Direct-GPT-4o yet scores 4.4 EM lower → single-model reflection not worth the cost +- ReAct (7B) uses 4.3 turns but still underperforms Direct-GPT-4o → tool-use alone insufficient without orchestration training +- OrchestratorR1 target: beat Direct-GPT-4o (EM>0.367) at comparable or lower cost + +## Decisions Log +- **2026-05-06:** Re-targeted NeurIPS 2026 → **AAAI 2027**. Reason: training still blocked on A100; rushing for May 6 deadline would mean shipping incomplete experiments. Aug deadline gives 3 months for full RL training + ablations. +- **2026-05-06:** Writing frozen until late July. Existing `intro_draft_aaai2027*.md` not rewritten in AAAI style yet. diff --git a/projects/OrchestratorR1/PROJECT.md b/projects/OrchestratorR1/PROJECT.md new file mode 100644 index 0000000..43f4257 --- /dev/null +++ b/projects/OrchestratorR1/PROJECT.md @@ -0,0 +1,36 @@ +# OrchestratorR1 — Project Overview + +## What +LLM trained via RL (PPO/GRPO) to act as an **orchestrator**: emits `Model:query` calls to a heterogeneous agent pool, ingests `` responses, and produces a final ``. Reward = EM/F1 minus optional API cost. + +## Codebase +- Path: `f:/data/code/Router-R1/OrchestratorR1/` +- Built on Router-R1 (NeurIPS 2025) which is built on veRL (Bytedance). +- Key modules: + - `orchestrator_r1/orchestrator/generation.py` — multi-turn rollout loop + - `orchestrator_r1/orchestrator/parser.py` — `` / `` parsing + - `orchestrator_r1/agent_pool/agent_registry.py` — symbolic name → API endpoint + pricing + - `training/train.py` — PPO/GRPO orchestration (Ray + FSDP) + - `eval/eval_orchestrator.py`, `eval/eval_react.py`, `eval/baselines.py` + +## Stack +veRL · vLLM · Ray · FSDP · PPO/GRPO · wandb · Hydra + +## Datasets +NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultihop, MuSiQue, Bamboogle, GPQA. + +## Agent Pool (current) +| Symbol | Backing model | $/1M tok | +|---|---|---| +| Qwen / Llama-8B / Mistral | gpt-4o-mini | 0.15 | +| Llama-70B | gpt-4o | 2.50 | +| Gemma | gemini-2.5-flash | 0.15 | +| Claude | claude-sonnet-4-6 | 3.00 | +| Gemini | gemini-2.5-pro | 1.25 | + +## Repo-level Docs (source of truth) +- `RESEARCH_PLAN.md` — full plan +- `FRAMEWORK.md` — architecture +- `COMPARISON.md` — vs Router-R1 / ReAct / Self-Reflection +- `docs/related_work_roadmap.md` +- `docs/intro_draft_aaai2027*.md` — paper draft (frozen until Aug writing sprint) From 91b6031be508b2f71e76906e6dc20353609eac5e Mon Sep 17 00:00:00 2001 From: Cayenne226 Date: Fri, 8 May 2026 21:30:04 +0800 Subject: [PATCH 2/3] docs: add code repo link to OrchestratorR1 PROJECT.md --- projects/OrchestratorR1/PROJECT.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/projects/OrchestratorR1/PROJECT.md b/projects/OrchestratorR1/PROJECT.md index 43f4257..a854c19 100644 --- a/projects/OrchestratorR1/PROJECT.md +++ b/projects/OrchestratorR1/PROJECT.md @@ -1,5 +1,9 @@ # OrchestratorR1 — Project Overview +## Code Repository + +**https://github.com/Cayenne226/OrchestratorR1** + ## What LLM trained via RL (PPO/GRPO) to act as an **orchestrator**: emits `Model:query` calls to a heterogeneous agent pool, ingests `` responses, and produces a final ``. Reward = EM/F1 minus optional API cost. From dfa06908ccfb86e49cde8f5ac322f17246746e7f Mon Sep 17 00:00:00 2001 From: Cayenne226 Date: Fri, 8 May 2026 21:43:27 +0800 Subject: [PATCH 3/3] docs: add HuggingFace dataset link to OrchestratorR1 PROJECT.md --- projects/OrchestratorR1/PROJECT.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/projects/OrchestratorR1/PROJECT.md b/projects/OrchestratorR1/PROJECT.md index a854c19..1968f1f 100644 --- a/projects/OrchestratorR1/PROJECT.md +++ b/projects/OrchestratorR1/PROJECT.md @@ -4,6 +4,10 @@ **https://github.com/Cayenne226/OrchestratorR1** +## Dataset + +**https://huggingface.co/datasets/Cayenne226/OrchestratorR1-data** + ## What LLM trained via RL (PPO/GRPO) to act as an **orchestrator**: emits `Model:query` calls to a heterogeneous agent pool, ingests `` responses, and produces a final ``. Reward = EM/F1 minus optional API cost.