From 30a607ce0635c455931295c92f5d3478c5b2f577 Mon Sep 17 00:00:00 2001
From: Cayenne226 <cayenne226@github.com>
Date: Fri, 8 May 2026 21:17:10 +0800
Subject: [PATCH 1/3] feat: add Cayenne226 member profile and OrchestratorR1
 project (AAAI 2027)

---
 lab/MEMBERS.md                      |  2 +-
 members/cayenne/ME.md               | 34 +++++++++++++++
 projects/OrchestratorR1/GOAL.md     | 27 ++++++++++++
 projects/OrchestratorR1/PROGRESS.md | 65 +++++++++++++++++++++++++++++
 projects/OrchestratorR1/PROJECT.md  | 36 ++++++++++++++++
 5 files changed, 163 insertions(+), 1 deletion(-)
 create mode 100644 members/cayenne/ME.md
 create mode 100644 projects/OrchestratorR1/GOAL.md
 create mode 100644 projects/OrchestratorR1/PROGRESS.md
 create mode 100644 projects/OrchestratorR1/PROJECT.md

diff --git a/lab/MEMBERS.md b/lab/MEMBERS.md
index 47ea3ff..f135519 100644
--- a/lab/MEMBERS.md
+++ b/lab/MEMBERS.md
@@ -5,7 +5,7 @@
 
 | Name | Role | Responsibilities | Projects | Can Help With | Joined |
 |------|------|-----------------|----------|---------------|--------|
-| <!-- Your Name --> | Admin | <!-- Your responsibilities --> | <!-- project-a, project-b --> | <!-- Your expertise --> | <!-- YYYY-MM --> |
+| Cayenne | Member | RL training, evaluation, paper writing | OrchestratorR1 | GRPO/PPO training, API-only eval pipelines, Qwen2.5 series | 2026-05 |
 
 ## Roles
 
diff --git a/members/cayenne/ME.md b/members/cayenne/ME.md
new file mode 100644
index 0000000..629c360
--- /dev/null
+++ b/members/cayenne/ME.md
@@ -0,0 +1,34 @@
+# Cayenne
+
+## Research Area
+
+Reinforcement Learning for LLM Agents, Multi-Agent Orchestration, QA / Reasoning
+
+## Current Projects
+
+- [OrchestratorR1](../../projects/OrchestratorR1/) — Lead, RL training and evaluation (targeting AAAI 2027)
+
+## AI Tool
+
+- [ ] OpenAI Codex CLI
+- [x] Claude Code
+
+## Expertise & Skills
+
+- **Languages:** Python, Bash
+- **Frameworks:** PyTorch, Hugging Face Transformers, trl, veRL
+- **Domains:** LLM Training (SFT/GRPO/PPO), Multi-Agent Systems, QA Benchmarking
+
+## Hardware
+
+- Local: RTX 3060 Laptop (API-only tasks, no heavy training)
+- Remote: A100 access pending (needed for Phase 2 training)
+
+## Sharing
+
+- **API-only eval pipelines**: ReAct, Self-Reflection, Direct baselines — can help set up fast
+- **veRL / trl training**: Familiar with GRPO/PPO on Qwen2.5 series
+
+## Links
+
+- GitHub: [Cayenne226](https://github.com/Cayenne226)
diff --git a/projects/OrchestratorR1/GOAL.md b/projects/OrchestratorR1/GOAL.md
new file mode 100644
index 0000000..1aeeac8
--- /dev/null
+++ b/projects/OrchestratorR1/GOAL.md
@@ -0,0 +1,27 @@
+# OrchestratorR1 — Goal
+
+## Target Venue
+**AAAI 2027**
+- Abstract deadline: early August 2026
+- Full paper deadline: mid-August 2026
+- Window from today (2026-05-06): ~3 months
+
+> Previously targeted NeurIPS 2026 (May 2026). Switched to AAAI 2027 to allow proper completion of RL training and ablations.
+
+## Research Question
+Can a base LLM be trained, via RL, to **orchestrate** multiple specialized agents (heterogeneous LLM routes) and aggregate their outputs into a single high-quality answer — outperforming both single-model baselines and fixed pipelines, while staying cost-aware?
+
+## Hypotheses
+1. **H1 — RL > heuristics:** GRPO-trained orchestrator beats Fixed-Pipeline and Direct-* baselines on multi-hop QA + GPQA.
+2. **H2 — Cost-aware policy:** A non-zero `cost_coe` produces orchestration policies on the cost/quality Pareto frontier without large accuracy loss.
+3. **H3 — Generalization:** Policy trained on NQ + HotpotQA transfers to held-out 2WikiMultihop / MuSiQue / Bamboogle / GPQA.
+
+## Success Criteria
+- Beat strongest baseline (Direct-GPT-4o or Fixed-Pipeline) by ≥3 EM on average across QA suite.
+- Demonstrate cost-quality Pareto curve via `cost_coe` sweep.
+- Provide ablations: SFT-only vs SFT+GRPO, w/ vs w/o cost penalty, max_turns sensitivity.
+
+## Non-Goals
+- New base model architecture.
+- Beating SOTA on any single dataset in absolute terms.
+- Production-grade serving infrastructure.
diff --git a/projects/OrchestratorR1/PROGRESS.md b/projects/OrchestratorR1/PROGRESS.md
new file mode 100644
index 0000000..91a901c
--- /dev/null
+++ b/projects/OrchestratorR1/PROGRESS.md
@@ -0,0 +1,65 @@
+# OrchestratorR1 — Progress
+
+> **Live status doc.** Update when tasks complete or blockers change. Last updated: 2026-05-06.
+> Target: **AAAI 2027** — Abstract early Aug, Full paper mid-Aug 2026.
+
+## Snapshot
+- **Done:** 7 / 33 (added B5 ReAct, B6 Self-Reflection — 2026-05-08)
+- **In progress:** agentlab git sync
+- **Blocked by:** A100 access (user has 3060 Laptop only)
+
+## 3-Month Plan (May → Aug 2026)
+
+### Phase 1 — May 2026: Unblock + finish CPU/API-only work
+| ID | Task | Status | Notes |
+|---|---|---|---|
+| T1.1–T1.4, T1.7 | Data prep (all datasets) | ✅ done | parquet ready |
+| B1 | Direct-GPT-4o baseline | ✅ done | QA + GPQA |
+| B2 | Direct-Strong baseline | ✅ done | |
+| B3 | Direct-Cheap baseline | ✅ done | |
+| B4 | Fixed-Pipeline baseline | ✅ done | |
+| B5 | ReAct baseline (`eval_react.py`) | ✅ done | EM=0.199 F1=0.308 avg_turns=4.3 avg_cost=$0.00070 (3000 QA) |
+| B6 | Self-Reflection baseline | ✅ done | EM=0.323 F1=0.445 avg_turns=5.0 avg_cost=$0.00553 (3000 QA) |
+| B7 | Conductor / Router-R1 reference data | ⬜ todo | |
+| INF1 | Secure A100 access (cloud / lab cluster) | 🔴 blocker | unblocks Phase 2 |
+| W1 | Re-scope writing for AAAI format (7+2 pages) | 🟡 deferred | start in late Jul |
+
+### Phase 2 — Jun–Jul 2026: Core training + ablations
+- SFT warmup on `data/sft_warmup.jsonl` (7B, QLoRA + DDP fallback if needed)
+- GRPO main training on QA suite
+- Ablations: cost_coe ∈ {0, 0.01, 0.1}, max_turns ∈ {2, 4, 6}, SFT-only vs SFT+GRPO
+- Held-out eval: 2Wiki / MuSiQue / Bamboogle / GPQA
+- Cost-quality Pareto plot
+
+### Phase 3 — Early Aug 2026: Writing sprint
+- Adapt `intro_draft_aaai2027*.md` to AAAI 7+2 page format
+- Method, experiments, related work polish
+- Submit abstract → submit full paper
+
+## Active Blockers
+1. **A100 access** — every training task downstream depends on this. Action: investigate cloud (RunPod / Vast / Lambda) vs lab cluster reservation.
+
+## Recent Changes (from git)
+- Switched 7B LoRA → QLoRA (4-bit) + DDP to fix OOM on 3090
+- FSDP → FULL_SHARD (ZeRO-3) attempt for 7B LoRA OOM mitigation
+- H200 7B training mode configured (a6d568d)
+
+## QA Baseline Summary (3000 samples, 2026-05-08)
+
+| Method | EM | F1 | AvgCost | AvgTurns |
+|--------|----|----|---------|----------|
+| Direct-GPT-4o | **0.367** | **0.474** | $0.00041 | 1.0 |
+| Self-Reflection (GPT-4o ×5) | 0.323 | 0.445 | $0.00553 | 5.0 |
+| Direct-Strong | 0.182 | 0.332 | $0.00038 | 1.0 |
+| Direct-Cheap | 0.192 | 0.317 | $0.00001 | 1.0 |
+| ReAct (Qwen2.5-7B) | 0.199 | 0.308 | $0.00070 | 4.3 |
+| Fixed-Pipeline | 0.046 | 0.227 | $0.00255 | 5.1 |
+
+Key takeaways:
+- Self-Reflection costs **13× more** than Direct-GPT-4o yet scores 4.4 EM lower → single-model reflection not worth the cost
+- ReAct (7B) uses 4.3 turns but still underperforms Direct-GPT-4o → tool-use alone insufficient without orchestration training
+- OrchestratorR1 target: beat Direct-GPT-4o (EM>0.367) at comparable or lower cost
+
+## Decisions Log
+- **2026-05-06:** Re-targeted NeurIPS 2026 → **AAAI 2027**. Reason: training still blocked on A100; rushing for May 6 deadline would mean shipping incomplete experiments. Aug deadline gives 3 months for full RL training + ablations.
+- **2026-05-06:** Writing frozen until late July. Existing `intro_draft_aaai2027*.md` not rewritten in AAAI style yet.
diff --git a/projects/OrchestratorR1/PROJECT.md b/projects/OrchestratorR1/PROJECT.md
new file mode 100644
index 0000000..43f4257
--- /dev/null
+++ b/projects/OrchestratorR1/PROJECT.md
@@ -0,0 +1,36 @@
+# OrchestratorR1 — Project Overview
+
+## What
+LLM trained via RL (PPO/GRPO) to act as an **orchestrator**: emits `<search>Model:query</search>` calls to a heterogeneous agent pool, ingests `<information>` responses, and produces a final `<answer>`. Reward = EM/F1 minus optional API cost.
+
+## Codebase
+- Path: `f:/data/code/Router-R1/OrchestratorR1/`
+- Built on Router-R1 (NeurIPS 2025) which is built on veRL (Bytedance).
+- Key modules:
+  - `orchestrator_r1/orchestrator/generation.py` — multi-turn rollout loop
+  - `orchestrator_r1/orchestrator/parser.py` — `<search>` / `<answer>` parsing
+  - `orchestrator_r1/agent_pool/agent_registry.py` — symbolic name → API endpoint + pricing
+  - `training/train.py` — PPO/GRPO orchestration (Ray + FSDP)
+  - `eval/eval_orchestrator.py`, `eval/eval_react.py`, `eval/baselines.py`
+
+## Stack
+veRL · vLLM · Ray · FSDP · PPO/GRPO · wandb · Hydra
+
+## Datasets
+NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultihop, MuSiQue, Bamboogle, GPQA.
+
+## Agent Pool (current)
+| Symbol | Backing model | $/1M tok |
+|---|---|---|
+| Qwen / Llama-8B / Mistral | gpt-4o-mini | 0.15 |
+| Llama-70B | gpt-4o | 2.50 |
+| Gemma | gemini-2.5-flash | 0.15 |
+| Claude | claude-sonnet-4-6 | 3.00 |
+| Gemini | gemini-2.5-pro | 1.25 |
+
+## Repo-level Docs (source of truth)
+- `RESEARCH_PLAN.md` — full plan
+- `FRAMEWORK.md` — architecture
+- `COMPARISON.md` — vs Router-R1 / ReAct / Self-Reflection
+- `docs/related_work_roadmap.md`
+- `docs/intro_draft_aaai2027*.md` — paper draft (frozen until Aug writing sprint)

From 91b6031be508b2f71e76906e6dc20353609eac5e Mon Sep 17 00:00:00 2001
From: Cayenne226 <cayenne226@github.com>
Date: Fri, 8 May 2026 21:30:04 +0800
Subject: [PATCH 2/3] docs: add code repo link to OrchestratorR1 PROJECT.md

---
 projects/OrchestratorR1/PROJECT.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/projects/OrchestratorR1/PROJECT.md b/projects/OrchestratorR1/PROJECT.md
index 43f4257..a854c19 100644
--- a/projects/OrchestratorR1/PROJECT.md
+++ b/projects/OrchestratorR1/PROJECT.md
@@ -1,5 +1,9 @@
 # OrchestratorR1 — Project Overview
 
+## Code Repository
+
+**https://github.com/Cayenne226/OrchestratorR1**
+
 ## What
 LLM trained via RL (PPO/GRPO) to act as an **orchestrator**: emits `<search>Model:query</search>` calls to a heterogeneous agent pool, ingests `<information>` responses, and produces a final `<answer>`. Reward = EM/F1 minus optional API cost.
 

From dfa06908ccfb86e49cde8f5ac322f17246746e7f Mon Sep 17 00:00:00 2001
From: Cayenne226 <cayenne226@github.com>
Date: Fri, 8 May 2026 21:43:27 +0800
Subject: [PATCH 3/3] docs: add HuggingFace dataset link to OrchestratorR1
 PROJECT.md

---
 projects/OrchestratorR1/PROJECT.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/projects/OrchestratorR1/PROJECT.md b/projects/OrchestratorR1/PROJECT.md
index a854c19..1968f1f 100644
--- a/projects/OrchestratorR1/PROJECT.md
+++ b/projects/OrchestratorR1/PROJECT.md
@@ -4,6 +4,10 @@
 
 **https://github.com/Cayenne226/OrchestratorR1**
 
+## Dataset
+
+**https://huggingface.co/datasets/Cayenne226/OrchestratorR1-data**
+
 ## What
 LLM trained via RL (PPO/GRPO) to act as an **orchestrator**: emits `<search>Model:query</search>` calls to a heterogeneous agent pool, ingests `<information>` responses, and produces a final `<answer>`. Reward = EM/F1 minus optional API cost.