Skip to content

[Feature] Tool-based shallow Tree-of-Thought (depth=1 Best-of-N) with sandbox-first scoring for hard decisions #276

@Mikito717

Description

@Mikito717

Background

The current agent loop in ml-intern generally follows a single reasoning path. When debugging non-trivial training pipelines or making architecture-level decisions, the agent can get stuck in local optima or repeat the same failed attempts (“doom loop”).

A common mitigation is to generate multiple candidate approaches at test time and select the best one (“Best-of-N”). A shallow Tree-of-Thought (ToT) with depth=1 is effectively Best-of-N, while keeping the search bounded and safer.

Terminology (glossary)

  • Best-of-N: generate N candidate solutions and pick the best via scoring.
  • ToT (Tree of Thought): explore a tree of reasoning steps; in Phase 1 we keep it depth=1 (shallow) for safety.
  • Sandbox evaluation: run checks/tests in an isolated runtime to score candidates objectively.

Proposal (Phase 1 scope: depth=1, non-breaking)

Introduce a new pluggable tool (e.g. reasoning_search) registered via the existing tool system (ToolSpec / ToolRouter in agent/core/tools.py).

When the agent hits a hard decision point (or repeated failure), it can call this tool to execute a bounded search:

  1. Candidate generation: generate N diverse candidate plans/patch approaches for the current decision.
  2. Evaluation (sandbox-first): score candidates primarily via sandbox execution (e.g. run a short command/test/lint or a minimal repro).
    • If sandbox evaluation is unavailable for the specific task, fall back to a lightweight LLM-as-a-judge scorer with strict token limits.
  3. Selection: return the best candidate (plus minimal metadata) back to the main loop.

This approach keeps the main loop stable and avoids rewriting the agent orchestrator into a full ToT engine in Phase 1.

Safety / Cost Control (must-have)

To prevent runaway cost and context-window issues:

  • Depth fixed to 1 in Phase 1.
  • n_candidates capped (e.g. 2–5) and configurable.
  • Strict per-call budgets: token caps + timeout for evaluation.
  • Candidate de-duplication (avoid scoring the same candidate multiple times).
  • Feature disabled by default (opt-in), or enabled only for specific tasks.

Acceptance Criteria

Tool implementation

  • A new tool (e.g. reasoning_search) that returns a single “best” candidate to the caller.

Sandbox-first scoring

  • Provide a scoring strategy that prefers sandbox results (exit code, tests, etc.).
  • LLM-judge scoring is allowed only as a fallback and must be budget-limited.

Config & limits

  • Add configuration options (e.g. n_candidates, timeout, token_budget, max_concurrency) in a central config (e.g. agent/config.py).

Testing

  • Add regression tests with a mocked environment and fixed seed demonstrating the tool:
    • generates multiple candidates,
    • prunes poor candidates,
    • returns the expected best candidate.

UI compatibility

  • No frontend changes required. Emit only standard tool status/log events (e.g. “searching…”, “evaluating candidate 2/4…”).

Alternatives Considered

Rewriting the agent loop into a general ToT/beam-search orchestrator was considered, but it risks:

  • larger context-window pressure,
  • harder maintainability and UI integration,
  • higher cost by default.

Phase 1 explicitly avoids that by encapsulating the search inside a tool.

Open Questions

  1. What timeout is acceptable for sandbox evaluation to balance speed vs. reliability?
  2. How should we represent the internal “candidate evaluation trace” in existing traces/logging without UI changes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions