Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 34 additions & 43 deletions docs/api/selectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ selector = ModelSelector(
eval_fn=lambda expected, actual: float(actual == expected),
dataset=[(inp, expected), ...],
method="auto", # arm_elimination — strong + cheap
objective_mode="weighted",
lambda_latency=0.2,
)
results = selector.select_best(parallel=True, max_concurrent=20)
results.print_summary()
Expand All @@ -24,9 +26,10 @@ results.print_summary()
| `models` | `Dict[str, List]` | Maps node names to candidate model lists (e.g. `{"planner": ["gpt-4o", "gpt-4o-mini"]}`). |
| `eval_fn` | `Callable` | `(expected, actual) -> float` score (higher is better). |
| `dataset` | `Sequence[Tuple]` | `[(input_data, expected_answer), ...]`. |
| `objective_mode` | `str`, **required** | `"weighted"` — one recommended combo via `lambda_cost` / `lambda_latency`. `"pareto"` — empirical frontier (error, latency, cost); matrix UCB uses Chebyshev exploration internally. |
| `model_prices` | `Dict`, optional | Custom pricing overrides: `{"model": {"input_price": x, "output_price": y}}` in $/MTok. Required for cost terms when `lambda_cost > 0`. |
| `lambda_cost` | `float`, optional | Weight on **normalized** per-sample cost in the combined objective. Default `0.0` (disabled). See [Combined objective](#combined-objective-optional-costlatency-weights) below. |
| `lambda_latency` | `float`, optional | Weight on **normalized** per-sample latency in the combined objective. Default `0.0` (disabled). |
| `lambda_cost` | `float` | Weight on **normalized** per-sample cost (**weighted** mode only). |
| `lambda_latency` | `float` | Weight on **normalized** per-sample latency (**weighted** mode only). |
| `node_descriptions` | `Dict[str, str]`, optional | Human-readable descriptions per node — surfaced in `LMProposalModelSelector`. |
| `tracker` | `LLMTracker`, optional | Bring your own. Defaults to a fresh `LLMTracker()` started in the constructor. Pass one in to share a cache across runs, route via a daemon (`AGENTOPT_GATEWAY_URL`), or post-process records after `select_best()` returns. |

Expand All @@ -41,67 +44,55 @@ print(tracker.get_usage()) # tracker.stop() already called; records sti

See [tracker.md](tracker.md) for the full tracker surface.

## Combined objective (optional cost/latency weights)

By default, selectors optimize **`eval_fn` score only** (typically accuracy) and break ties with latency, then price. To trade accuracy against cost and latency in one scalar reward, pass optional weights on the constructor (or via `ModelSelector(..., **kwargs)`):

| Parameter | Default | Effect |
|:---|:---|:---|
| `lambda_cost` | `0.0` | Penalizes normalized per-sample **token cost** (USD from the tracker, or `model_prices`). |
| `lambda_latency` | `0.0` | Penalizes normalized per-sample **wall-clock latency** (seconds). |
## Objective mode (required)

Omit both parameters (or leave them at `0.0`) for the original accuracy-centric behavior. Set one or both when you want multi-metric selection.
You must set `objective_mode` on every selector.

### Formula
### `objective_mode="weighted"`

For each datapoint, after observations are recorded:
Pass at least one of `lambda_cost > 0` or `lambda_latency > 0`. The library returns a single **`is_best`** combo using a linear scalar (accuracy minus weighted normalized cost/latency):

```
combined = score
- lambda_cost * norm(cost)
- lambda_latency * norm(latency)
combined = score - lambda_cost * norm(cost) - lambda_latency * norm(latency)
```

- **`score`** — return value of `eval_fn` (higher is better).
- **`norm(·)`** — min–max scale to `[0, 1]` using running min/max over **all** samples seen during that selector run (updated as more combos are evaluated).
- **Per combination** — mean of per-datapoint combined values → `ModelResult.combined_objective` (see [results.md](results.md)).
```python
selector = ModelSelector(
...,
objective_mode="weighted",
lambda_cost=0.3,
lambda_latency=0.2,
model_prices={...},
)
results = selector.select_best()
best = results.get_best()
```

This is a **linear scalarization**, not Pareto exploration. Larger `lambda_*` penalize cost/latency more strongly relative to score.
### `objective_mode="pareto"`

### Example
Do **not** pass `lambda_cost` or `lambda_latency`. The library minimizes **error** (`1 - score`), **latency**, and **cost** (when priced), marks nondominated combos, and exposes `results.get_pareto_front()` and `results.plot_pareto()` (error on the y-axis; ideal corner at 0).

```python
selector = ModelSelector(
agent=MyAgent,
models=models,
eval_fn=eval_fn,
dataset=dataset,
...,
method="matrix_ucb",
lambda_cost=0.3, # optional — omit for accuracy-only
lambda_latency=0.2,
model_prices={ # recommended when lambda_cost > 0
"gpt-4o": {"input_price": 2.5, "output_price": 10.0},
"gpt-4o-mini": {"input_price": 0.15, "output_price": 0.6},
},
objective_mode="pareto",
)
results = selector.select_best(parallel=True)
results.print_summary() # ranks by combined_objective when lambdas are set
results = selector.select_best()
results.get_pareto_front()
results.plot_pareto()
```

### How each method uses the weights
For `matrix_ucb` / `matrix_ucb_lrf`, exploration uses **Chebyshev scalarization** over normalized gaps (ideal = 0 error, 0s, $0); tradeoff directions rotate automatically — no extra knobs.

| Methods | During search | Final `is_best` |
| Methods | Weighted search | Pareto search |
|:---|:---|:---|
| `matrix_ucb`, `matrix_ucb_lrf` | UCB rewards use per-cell combined objective | `_find_best` on `combined_objective` |
| `arm_elimination`, `epsilon_lucb`, `threshold` | Elimination / LUCB stats on combined per-sample objectives | same |
| `hill_climbing`, `bayesian` | Move / surrogate target uses combined objective | same |
| `brute_force`, `random` | Does not steer *which* combos to try | same |
| `lm_proposal` | Proposer uses `objective=` **text**, not these lambdas | `combined_objective` on the one evaluated combo only |

After `select_best()`, a final pass recomputes every result’s `combined_objective` against the **full-run** normalizer so rankings are comparable.
| `matrix_ucb`, `matrix_ucb_lrf` | Per-cell linear combined objective | Chebyshev cell reward |
| Other bandits | Combined per-sample stats where applicable | Full eval → frontier marking |
| `brute_force`, `random` | Final rank only | Final frontier only |

!!! note "`lm_proposal` vs lambdas"
`LMProposalModelSelector(objective="...")` is a natural-language hint to the **proposer LLM**. It is separate from `lambda_cost` / `lambda_latency`, which only affect the scalar reward used for ranking and bandit methods.
`LMProposalModelSelector(objective="...")` is a natural-language hint to the **proposer LLM**. It is separate from `objective_mode` and `lambda_*`.

## `select_best()`

Expand Down
1 change: 1 addition & 0 deletions examples/selection/daemon/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ def eval_fn(expected: str, actual: str) -> float:
eval_fn=eval_fn,
dataset=dataset,
method="brute_force",
objective_mode="pareto",
)
results = selector.select_best(parallel=False)
results.print_summary()
Expand Down
24 changes: 23 additions & 1 deletion examples/selection/local/advanced_algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,13 @@ def eval_fn(expected, actual):
def run_auto():
"""method="auto" — automatically finds the best combination (default; wired to arm_elimination — strong best-arm identification, cheaper than brute_force)."""
selector = ModelSelector(
agent=MyAgent, models=models, eval_fn=eval_fn, dataset=dataset, method="auto",
agent=MyAgent,
models=models,
eval_fn=eval_fn,
dataset=dataset,
method="auto",
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -112,6 +118,8 @@ def run_random():
dataset=dataset,
method="random",
sample_fraction=0.25, # evaluate 25% of all combinations
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -125,6 +133,8 @@ def run_hill_climbing():
dataset=dataset,
method="hill_climbing",
batch_size=4, # number of neighbors to evaluate per step
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -137,6 +147,8 @@ def run_arm_elimination():
eval_fn=eval_fn,
dataset=dataset,
method="arm_elimination",
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -150,6 +162,8 @@ def run_epsilon_lucb():
dataset=dataset,
method="epsilon_lucb",
epsilon=0.01, # acceptable gap from the true best
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -163,6 +177,8 @@ def run_threshold():
dataset=dataset,
method="threshold",
threshold=0.75, # minimum acceptable accuracy
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -175,6 +191,8 @@ def run_lm_proposal():
eval_fn=eval_fn,
dataset=dataset,
method="lm_proposal",
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -189,6 +207,8 @@ def run_bayesian():
method="bayesian",
batch_size=4,
sample_fraction=0.25, # evaluate 25% of all combinations
objective_mode="weighted",
lambda_latency=0.2,
)
return selector.select_best(parallel=True)

Expand All @@ -203,6 +223,7 @@ def run_matrix_ucb():
method="matrix_ucb",
a=1.0,
sample_fraction=0.1,
objective_mode="pareto",
)
return selector.select_best(max_concurrent=4)

Expand All @@ -221,6 +242,7 @@ def run_matrix_ucb_lrf():
eta=5.0,
warmup_fraction=0.05,
sample_fraction=0.1,
objective_mode="pareto",
)
# Unlike matrix_ucb (which always uses async eval), LRF still uses parallel=True
# for concurrent cell evaluation; sequential path is sync-only.
Expand Down
1 change: 1 addition & 0 deletions examples/selection/local/ag2.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ def eval_fn(expected, actual):
eval_fn=eval_fn,
dataset=dataset,
method="brute_force", # or "auto" for smarter selection algorithms
objective_mode="pareto",
)

results = selector.select_best(parallel=True)
Expand Down
1 change: 1 addition & 0 deletions examples/selection/local/crewai.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ def eval_fn(expected, actual):
eval_fn=eval_fn,
dataset=dataset,
method="brute_force", # or "auto" for smarter selection algorithms
objective_mode="pareto",
)

results = selector.select_best(parallel=True)
Expand Down
1 change: 1 addition & 0 deletions examples/selection/local/custom_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ def eval_fn(expected, actual):
eval_fn=eval_fn,
dataset=dataset,
method="brute_force", # or "auto" for smarter selection algorithms
objective_mode="pareto",
)

results = selector.select_best(parallel=True)
Expand Down
1 change: 1 addition & 0 deletions examples/selection/local/langchain.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ def eval_fn(expected, actual):
eval_fn=eval_fn,
dataset=dataset,
method="brute_force", # or "auto" for smarter selection algorithms
objective_mode="pareto",
)

results = selector.select_best(parallel=True)
Expand Down
1 change: 1 addition & 0 deletions examples/selection/local/langgraph.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ def eval_fn(expected, actual):
eval_fn=eval_fn,
dataset=dataset,
method="brute_force", # or "auto" for smarter selection algorithms
objective_mode="pareto",
)

results = selector.select_best(parallel=True)
Expand Down
1 change: 1 addition & 0 deletions examples/selection/local/llamaindex.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ def eval_fn(expected, actual):
eval_fn=eval_fn,
dataset=dataset,
method="brute_force", # or "auto" for smarter selection algorithms
objective_mode="pareto",
)

results = selector.select_best(parallel=True)
Expand Down
1 change: 1 addition & 0 deletions examples/selection/local/openai_sdk.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ def eval_fn(expected, actual):
eval_fn=eval_fn,
dataset=dataset,
method="brute_force", # or "auto" for smarter selection algorithms
objective_mode="pareto",
)

results = selector.select_best(parallel=True)
Expand Down
1 change: 1 addition & 0 deletions examples/shared/openclaw_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
eval_fn=my_eval_fn,
dataset=my_dataset,
method="brute_force",
objective_mode="pareto",
)
results = selector.select_best(parallel=False)

Expand Down
9 changes: 4 additions & 5 deletions src/agentopt/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,11 +135,10 @@ def ModelSelector(
``"epsilon_lucb"``, ``"matrix_ucb"``, ``"matrix_ucb_lrf"``,
``"threshold"``,
``"lm_proposal"``, ``"bayesian"``.
**kwargs: Additional arguments passed to the selector
(e.g. ``epsilon``, ``threshold``, ``sample_fraction``, ``warmup_fraction``
for matrix UCB-LRF; ``lambda_cost``, ``lambda_latency`` for the optional
combined objective ``score - lambda_cost*norm_cost -
lambda_latency*norm_latency`` — both default to ``0.0`` / accuracy-only).
**kwargs: Additional arguments passed to the selector. Required:
``objective_mode`` — ``"weighted"`` (pass ``lambda_cost`` and/or
``lambda_latency`` > 0) or ``"pareto"`` (frontier; Chebyshev matrix UCB).
Other options: ``epsilon``, ``threshold``, ``sample_fraction``, etc.

Returns:
A selector instance. Call ``.select_best()`` to run.
Expand Down
2 changes: 2 additions & 0 deletions src/agentopt/model_selection/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from .random_search import RandomSearchModelSelector
from .threshold_successive_elimination import ThresholdBanditSEModelSelector
from .matrix_ucb import MatrixUCBLRFModelSelector, MatrixUCBModelSelector
from .objectives import ObjectiveMode

# Bayesian is optional (requires torch/botorch)
try:
Expand All @@ -31,4 +32,5 @@
"DatapointResult",
"ModelResult",
"SelectionResults",
"ObjectiveMode",
]
29 changes: 19 additions & 10 deletions src/agentopt/model_selection/arm_elimination.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def __init__(
confidence: float = 1.0,
model_prices: Optional[Dict[str, Dict[str, float]]] = None,
tracker=None,
objective_mode: Optional[str] = None,
lambda_cost: float = 0.0,
lambda_latency: float = 0.0,
) -> None:
Expand All @@ -39,6 +40,7 @@ def __init__(
dataset=dataset,
model_prices=model_prices,
tracker=tracker,
objective_mode=objective_mode,
lambda_cost=lambda_cost,
lambda_latency=lambda_latency,
)
Expand Down Expand Up @@ -152,7 +154,9 @@ def _select_sequential(self) -> SelectionResults:
all_results = self._build_results(
all_combos, combo_scores, combo_latencies, combo_costs, combo_dp_ids
)
return SelectionResults(results=all_results)
return SelectionResults(
results=all_results, objective_mode=self.objective_mode,
)

async def _select_async(self, max_concurrent: int = 20) -> SelectionResults:
all_combos = self._all_combos()
Expand Down Expand Up @@ -274,7 +278,9 @@ async def _eval_batch(
all_results = self._build_results(
all_combos, combo_scores, combo_latencies, combo_costs, combo_dp_ids
)
return SelectionResults(results=all_results)
return SelectionResults(
results=all_results, objective_mode=self.objective_mode,
)

# ------------------------------------------------------------------
# Statistical helpers
Expand Down Expand Up @@ -341,13 +347,16 @@ def _build_results(
)

self._finalize_combined_objectives(all_results)
best_info = self._find_best(all_results)
if best_info is not None:
best_name, _ = best_info
for result in all_results:
if result.model_name == best_name:
result.is_best = True
break
if self.objective_mode == "pareto":
self._mark_pareto_optimal(all_results)
else:
print("\n No combinations succeeded.")
best_info = self._find_best(all_results)
if best_info is not None:
best_name, _ = best_info
for result in all_results:
if result.model_name == best_name:
result.is_best = True
break
else:
print("\n No combinations succeeded.")
return all_results
Loading