Background
The current agent loop in ml-intern generally follows a single reasoning path. When debugging non-trivial training pipelines or making architecture-level decisions, the agent can get stuck in local optima or repeat the same failed attempts (“doom loop”).
A common mitigation is to generate multiple candidate approaches at test time and select the best one (“Best-of-N”). A shallow Tree-of-Thought (ToT) with depth=1 is effectively Best-of-N, while keeping the search bounded and safer.
Terminology (glossary)
- Best-of-N: generate N candidate solutions and pick the best via scoring.
- ToT (Tree of Thought): explore a tree of reasoning steps; in Phase 1 we keep it
depth=1 (shallow) for safety.
- Sandbox evaluation: run checks/tests in an isolated runtime to score candidates objectively.
Proposal (Phase 1 scope: depth=1, non-breaking)
Introduce a new pluggable tool (e.g. reasoning_search) registered via the existing tool system (ToolSpec / ToolRouter in agent/core/tools.py).
When the agent hits a hard decision point (or repeated failure), it can call this tool to execute a bounded search:
- Candidate generation: generate N diverse candidate plans/patch approaches for the current decision.
- Evaluation (sandbox-first): score candidates primarily via sandbox execution (e.g. run a short command/test/lint or a minimal repro).
- If sandbox evaluation is unavailable for the specific task, fall back to a lightweight LLM-as-a-judge scorer with strict token limits.
- Selection: return the best candidate (plus minimal metadata) back to the main loop.
This approach keeps the main loop stable and avoids rewriting the agent orchestrator into a full ToT engine in Phase 1.
Safety / Cost Control (must-have)
To prevent runaway cost and context-window issues:
- Depth fixed to 1 in Phase 1.
n_candidates capped (e.g. 2–5) and configurable.
- Strict per-call budgets: token caps + timeout for evaluation.
- Candidate de-duplication (avoid scoring the same candidate multiple times).
- Feature disabled by default (opt-in), or enabled only for specific tasks.
Acceptance Criteria
Tool implementation
Sandbox-first scoring
Config & limits
Testing
UI compatibility
Alternatives Considered
Rewriting the agent loop into a general ToT/beam-search orchestrator was considered, but it risks:
- larger context-window pressure,
- harder maintainability and UI integration,
- higher cost by default.
Phase 1 explicitly avoids that by encapsulating the search inside a tool.
Open Questions
- What timeout is acceptable for sandbox evaluation to balance speed vs. reliability?
- How should we represent the internal “candidate evaluation trace” in existing traces/logging without UI changes?
Background
The current agent loop in
ml-interngenerally follows a single reasoning path. When debugging non-trivial training pipelines or making architecture-level decisions, the agent can get stuck in local optima or repeat the same failed attempts (“doom loop”).A common mitigation is to generate multiple candidate approaches at test time and select the best one (“Best-of-N”). A shallow Tree-of-Thought (ToT) with
depth=1is effectively Best-of-N, while keeping the search bounded and safer.Terminology (glossary)
depth=1(shallow) for safety.Proposal (Phase 1 scope: depth=1, non-breaking)
Introduce a new pluggable tool (e.g.
reasoning_search) registered via the existing tool system (ToolSpec/ToolRouterinagent/core/tools.py).When the agent hits a hard decision point (or repeated failure), it can call this tool to execute a bounded search:
This approach keeps the main loop stable and avoids rewriting the agent orchestrator into a full ToT engine in Phase 1.
Safety / Cost Control (must-have)
To prevent runaway cost and context-window issues:
n_candidatescapped (e.g. 2–5) and configurable.Acceptance Criteria
Tool implementation
reasoning_search) that returns a single “best” candidate to the caller.Sandbox-first scoring
Config & limits
n_candidates,timeout,token_budget,max_concurrency) in a central config (e.g.agent/config.py).Testing
UI compatibility
Alternatives Considered
Rewriting the agent loop into a general ToT/beam-search orchestrator was considered, but it risks:
Phase 1 explicitly avoids that by encapsulating the search inside a tool.
Open Questions