Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion lab/MEMBERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

| Name | Role | Responsibilities | Projects | Can Help With | Joined |
|------|------|-----------------|----------|---------------|--------|
| <!-- Your Name --> | Admin | <!-- Your responsibilities --> | <!-- project-a, project-b --> | <!-- Your expertise --> | <!-- YYYY-MM --> |
| Cayenne | Member | RL training, evaluation, paper writing | OrchestratorR1 | GRPO/PPO training, API-only eval pipelines, Qwen2.5 series | 2026-05 |

## Roles

Expand Down
34 changes: 34 additions & 0 deletions members/cayenne/ME.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Cayenne

## Research Area

Reinforcement Learning for LLM Agents, Multi-Agent Orchestration, QA / Reasoning

## Current Projects

- [OrchestratorR1](../../projects/OrchestratorR1/) — Lead, RL training and evaluation (targeting AAAI 2027)

## AI Tool

- [ ] OpenAI Codex CLI
- [x] Claude Code

## Expertise & Skills

- **Languages:** Python, Bash
- **Frameworks:** PyTorch, Hugging Face Transformers, trl, veRL
- **Domains:** LLM Training (SFT/GRPO/PPO), Multi-Agent Systems, QA Benchmarking

## Hardware

- Local: RTX 3060 Laptop (API-only tasks, no heavy training)
- Remote: A100 access pending (needed for Phase 2 training)

## Sharing

- **API-only eval pipelines**: ReAct, Self-Reflection, Direct baselines — can help set up fast
- **veRL / trl training**: Familiar with GRPO/PPO on Qwen2.5 series

## Links

- GitHub: [Cayenne226](https://github.com/Cayenne226)
27 changes: 27 additions & 0 deletions projects/OrchestratorR1/GOAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# OrchestratorR1 — Goal

## Target Venue
**AAAI 2027**
- Abstract deadline: early August 2026
- Full paper deadline: mid-August 2026
- Window from today (2026-05-06): ~3 months

> Previously targeted NeurIPS 2026 (May 2026). Switched to AAAI 2027 to allow proper completion of RL training and ablations.

## Research Question
Can a base LLM be trained, via RL, to **orchestrate** multiple specialized agents (heterogeneous LLM routes) and aggregate their outputs into a single high-quality answer — outperforming both single-model baselines and fixed pipelines, while staying cost-aware?

## Hypotheses
1. **H1 — RL > heuristics:** GRPO-trained orchestrator beats Fixed-Pipeline and Direct-* baselines on multi-hop QA + GPQA.
2. **H2 — Cost-aware policy:** A non-zero `cost_coe` produces orchestration policies on the cost/quality Pareto frontier without large accuracy loss.
3. **H3 — Generalization:** Policy trained on NQ + HotpotQA transfers to held-out 2WikiMultihop / MuSiQue / Bamboogle / GPQA.

## Success Criteria
- Beat strongest baseline (Direct-GPT-4o or Fixed-Pipeline) by ≥3 EM on average across QA suite.
- Demonstrate cost-quality Pareto curve via `cost_coe` sweep.
- Provide ablations: SFT-only vs SFT+GRPO, w/ vs w/o cost penalty, max_turns sensitivity.

## Non-Goals
- New base model architecture.
- Beating SOTA on any single dataset in absolute terms.
- Production-grade serving infrastructure.
65 changes: 65 additions & 0 deletions projects/OrchestratorR1/PROGRESS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# OrchestratorR1 — Progress

> **Live status doc.** Update when tasks complete or blockers change. Last updated: 2026-05-06.
> Target: **AAAI 2027** — Abstract early Aug, Full paper mid-Aug 2026.

## Snapshot
- **Done:** 7 / 33 (added B5 ReAct, B6 Self-Reflection — 2026-05-08)
- **In progress:** agentlab git sync
- **Blocked by:** A100 access (user has 3060 Laptop only)

## 3-Month Plan (May → Aug 2026)

### Phase 1 — May 2026: Unblock + finish CPU/API-only work
| ID | Task | Status | Notes |
|---|---|---|---|
| T1.1–T1.4, T1.7 | Data prep (all datasets) | ✅ done | parquet ready |
| B1 | Direct-GPT-4o baseline | ✅ done | QA + GPQA |
| B2 | Direct-Strong baseline | ✅ done | |
| B3 | Direct-Cheap baseline | ✅ done | |
| B4 | Fixed-Pipeline baseline | ✅ done | |
| B5 | ReAct baseline (`eval_react.py`) | ✅ done | EM=0.199 F1=0.308 avg_turns=4.3 avg_cost=$0.00070 (3000 QA) |
| B6 | Self-Reflection baseline | ✅ done | EM=0.323 F1=0.445 avg_turns=5.0 avg_cost=$0.00553 (3000 QA) |
| B7 | Conductor / Router-R1 reference data | ⬜ todo | |
| INF1 | Secure A100 access (cloud / lab cluster) | 🔴 blocker | unblocks Phase 2 |
| W1 | Re-scope writing for AAAI format (7+2 pages) | 🟡 deferred | start in late Jul |

### Phase 2 — Jun–Jul 2026: Core training + ablations
- SFT warmup on `data/sft_warmup.jsonl` (7B, QLoRA + DDP fallback if needed)
- GRPO main training on QA suite
- Ablations: cost_coe ∈ {0, 0.01, 0.1}, max_turns ∈ {2, 4, 6}, SFT-only vs SFT+GRPO
- Held-out eval: 2Wiki / MuSiQue / Bamboogle / GPQA
- Cost-quality Pareto plot

### Phase 3 — Early Aug 2026: Writing sprint
- Adapt `intro_draft_aaai2027*.md` to AAAI 7+2 page format
- Method, experiments, related work polish
- Submit abstract → submit full paper

## Active Blockers
1. **A100 access** — every training task downstream depends on this. Action: investigate cloud (RunPod / Vast / Lambda) vs lab cluster reservation.

## Recent Changes (from git)
- Switched 7B LoRA → QLoRA (4-bit) + DDP to fix OOM on 3090
- FSDP → FULL_SHARD (ZeRO-3) attempt for 7B LoRA OOM mitigation
- H200 7B training mode configured (a6d568d)

## QA Baseline Summary (3000 samples, 2026-05-08)

| Method | EM | F1 | AvgCost | AvgTurns |
|--------|----|----|---------|----------|
| Direct-GPT-4o | **0.367** | **0.474** | $0.00041 | 1.0 |
| Self-Reflection (GPT-4o ×5) | 0.323 | 0.445 | $0.00553 | 5.0 |
| Direct-Strong | 0.182 | 0.332 | $0.00038 | 1.0 |
| Direct-Cheap | 0.192 | 0.317 | $0.00001 | 1.0 |
| ReAct (Qwen2.5-7B) | 0.199 | 0.308 | $0.00070 | 4.3 |
| Fixed-Pipeline | 0.046 | 0.227 | $0.00255 | 5.1 |

Key takeaways:
- Self-Reflection costs **13× more** than Direct-GPT-4o yet scores 4.4 EM lower → single-model reflection not worth the cost
- ReAct (7B) uses 4.3 turns but still underperforms Direct-GPT-4o → tool-use alone insufficient without orchestration training
- OrchestratorR1 target: beat Direct-GPT-4o (EM>0.367) at comparable or lower cost

## Decisions Log
- **2026-05-06:** Re-targeted NeurIPS 2026 → **AAAI 2027**. Reason: training still blocked on A100; rushing for May 6 deadline would mean shipping incomplete experiments. Aug deadline gives 3 months for full RL training + ablations.
- **2026-05-06:** Writing frozen until late July. Existing `intro_draft_aaai2027*.md` not rewritten in AAAI style yet.
44 changes: 44 additions & 0 deletions projects/OrchestratorR1/PROJECT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# OrchestratorR1 — Project Overview

## Code Repository

**https://github.com/Cayenne226/OrchestratorR1**

## Dataset

**https://huggingface.co/datasets/Cayenne226/OrchestratorR1-data**

## What
LLM trained via RL (PPO/GRPO) to act as an **orchestrator**: emits `<search>Model:query</search>` calls to a heterogeneous agent pool, ingests `<information>` responses, and produces a final `<answer>`. Reward = EM/F1 minus optional API cost.

## Codebase
- Path: `f:/data/code/Router-R1/OrchestratorR1/`
- Built on Router-R1 (NeurIPS 2025) which is built on veRL (Bytedance).
- Key modules:
- `orchestrator_r1/orchestrator/generation.py` — multi-turn rollout loop
- `orchestrator_r1/orchestrator/parser.py` — `<search>` / `<answer>` parsing
- `orchestrator_r1/agent_pool/agent_registry.py` — symbolic name → API endpoint + pricing
- `training/train.py` — PPO/GRPO orchestration (Ray + FSDP)
- `eval/eval_orchestrator.py`, `eval/eval_react.py`, `eval/baselines.py`

## Stack
veRL · vLLM · Ray · FSDP · PPO/GRPO · wandb · Hydra

## Datasets
NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultihop, MuSiQue, Bamboogle, GPQA.

## Agent Pool (current)
| Symbol | Backing model | $/1M tok |
|---|---|---|
| Qwen / Llama-8B / Mistral | gpt-4o-mini | 0.15 |
| Llama-70B | gpt-4o | 2.50 |
| Gemma | gemini-2.5-flash | 0.15 |
| Claude | claude-sonnet-4-6 | 3.00 |
| Gemini | gemini-2.5-pro | 1.25 |

## Repo-level Docs (source of truth)
- `RESEARCH_PLAN.md` — full plan
- `FRAMEWORK.md` — architecture
- `COMPARISON.md` — vs Router-R1 / ReAct / Self-Reflection
- `docs/related_work_roadmap.md`
- `docs/intro_draft_aaai2027*.md` — paper draft (frozen until Aug writing sprint)