[Feature] Tool-based shallow Tree-of-Thought (depth=1 Best-of-N) with sandbox-first scoring for hard decisions

## Background
The current agent loop in `ml-intern` generally follows a single reasoning path. When debugging non-trivial training pipelines or making architecture-level decisions, the agent can get stuck in local optima or repeat the same failed attempts (“doom loop”).

A common mitigation is to generate multiple candidate approaches at test time and select the best one (“Best-of-N”). A shallow Tree-of-Thought (ToT) with `depth=1` is effectively Best-of-N, while keeping the search bounded and safer.

### Terminology (glossary)
- **Best-of-N**: generate **N** candidate solutions and pick the best via scoring.
- **ToT (Tree of Thought)**: explore a tree of reasoning steps; in Phase 1 we keep it `depth=1` (shallow) for safety.
- **Sandbox evaluation**: run checks/tests in an isolated runtime to score candidates objectively.

## Proposal (Phase 1 scope: depth=1, non-breaking)
Introduce a new pluggable tool (e.g. `reasoning_search`) registered via the existing tool system (`ToolSpec` / `ToolRouter` in `agent/core/tools.py`).

When the agent hits a hard decision point (or repeated failure), it can call this tool to execute a bounded search:

1. **Candidate generation**: generate **N** diverse candidate plans/patch approaches for the current decision.
2. **Evaluation (sandbox-first)**: score candidates primarily via sandbox execution (e.g. run a short command/test/lint or a minimal repro).
   - If sandbox evaluation is unavailable for the specific task, fall back to a lightweight **LLM-as-a-judge** scorer with strict token limits.
3. **Selection**: return the best candidate (plus minimal metadata) back to the main loop.

This approach keeps the main loop stable and avoids rewriting the agent orchestrator into a full ToT engine in Phase 1.

## Safety / Cost Control (must-have)
To prevent runaway cost and context-window issues:

- Depth fixed to **1** in Phase 1.
- `n_candidates` capped (e.g. **2–5**) and configurable.
- Strict per-call budgets: token caps + timeout for evaluation.
- Candidate de-duplication (avoid scoring the same candidate multiple times).
- Feature disabled by default (opt-in), or enabled only for specific tasks.

## Acceptance Criteria
### Tool implementation
- [ ] A new tool (e.g. `reasoning_search`) that returns a single “best” candidate to the caller.

### Sandbox-first scoring
- [ ] Provide a scoring strategy that prefers sandbox results (exit code, tests, etc.).
- [ ] LLM-judge scoring is allowed only as a fallback and must be budget-limited.

### Config & limits
- [ ] Add configuration options (e.g. `n_candidates`, `timeout`, `token_budget`, `max_concurrency`) in a central config (e.g. `agent/config.py`).

### Testing
- [ ] Add regression tests with a mocked environment and fixed seed demonstrating the tool:
  - [ ] generates multiple candidates,
  - [ ] prunes poor candidates,
  - [ ] returns the expected best candidate.

### UI compatibility
- [ ] No frontend changes required. Emit only standard tool status/log events (e.g. “searching…”, “evaluating candidate 2/4…”).

## Alternatives Considered
Rewriting the agent loop into a general ToT/beam-search orchestrator was considered, but it risks:

- larger context-window pressure,
- harder maintainability and UI integration,
- higher cost by default.

Phase 1 explicitly avoids that by encapsulating the search inside a tool.

## Open Questions
1. What timeout is acceptable for sandbox evaluation to balance speed vs. reliability?
2. How should we represent the internal “candidate evaluation trace” in existing traces/logging without UI changes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Tool-based shallow Tree-of-Thought (depth=1 Best-of-N) with sandbox-first scoring for hard decisions #276

Background

Terminology (glossary)

Proposal (Phase 1 scope: depth=1, non-breaking)

Safety / Cost Control (must-have)

Acceptance Criteria

Tool implementation

Sandbox-first scoring

Config & limits

Testing

UI compatibility

Alternatives Considered

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Tool-based shallow Tree-of-Thought (depth=1 Best-of-N) with sandbox-first scoring for hard decisions #276

Description

Background

Terminology (glossary)

Proposal (Phase 1 scope: depth=1, non-breaking)

Safety / Cost Control (must-have)

Acceptance Criteria

Tool implementation

Sandbox-first scoring

Config & limits

Testing

UI compatibility

Alternatives Considered

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions