diff --git a/docs/api/selectors.md b/docs/api/selectors.md index d990203..61125bd 100644 --- a/docs/api/selectors.md +++ b/docs/api/selectors.md @@ -11,6 +11,8 @@ selector = ModelSelector( eval_fn=lambda expected, actual: float(actual == expected), dataset=[(inp, expected), ...], method="auto", # arm_elimination — strong + cheap + objective_mode="weighted", + lambda_latency=0.2, ) results = selector.select_best(parallel=True, max_concurrent=20) results.print_summary() @@ -24,9 +26,10 @@ results.print_summary() | `models` | `Dict[str, List]` | Maps node names to candidate model lists (e.g. `{"planner": ["gpt-4o", "gpt-4o-mini"]}`). | | `eval_fn` | `Callable` | `(expected, actual) -> float` score (higher is better). | | `dataset` | `Sequence[Tuple]` | `[(input_data, expected_answer), ...]`. | +| `objective_mode` | `str`, **required** | `"weighted"` — one recommended combo via `lambda_cost` / `lambda_latency`. `"pareto"` — empirical frontier (error, latency, cost); matrix UCB uses Chebyshev exploration internally. | | `model_prices` | `Dict`, optional | Custom pricing overrides: `{"model": {"input_price": x, "output_price": y}}` in $/MTok. Required for cost terms when `lambda_cost > 0`. | -| `lambda_cost` | `float`, optional | Weight on **normalized** per-sample cost in the combined objective. Default `0.0` (disabled). See [Combined objective](#combined-objective-optional-costlatency-weights) below. | -| `lambda_latency` | `float`, optional | Weight on **normalized** per-sample latency in the combined objective. Default `0.0` (disabled). | +| `lambda_cost` | `float` | Weight on **normalized** per-sample cost (**weighted** mode only). | +| `lambda_latency` | `float` | Weight on **normalized** per-sample latency (**weighted** mode only). | | `node_descriptions` | `Dict[str, str]`, optional | Human-readable descriptions per node — surfaced in `LMProposalModelSelector`. | | `tracker` | `LLMTracker`, optional | Bring your own. Defaults to a fresh `LLMTracker()` started in the constructor. Pass one in to share a cache across runs, route via a daemon (`AGENTOPT_GATEWAY_URL`), or post-process records after `select_best()` returns. | @@ -41,67 +44,55 @@ print(tracker.get_usage()) # tracker.stop() already called; records sti See [tracker.md](tracker.md) for the full tracker surface. -## Combined objective (optional cost/latency weights) - -By default, selectors optimize **`eval_fn` score only** (typically accuracy) and break ties with latency, then price. To trade accuracy against cost and latency in one scalar reward, pass optional weights on the constructor (or via `ModelSelector(..., **kwargs)`): - -| Parameter | Default | Effect | -|:---|:---|:---| -| `lambda_cost` | `0.0` | Penalizes normalized per-sample **token cost** (USD from the tracker, or `model_prices`). | -| `lambda_latency` | `0.0` | Penalizes normalized per-sample **wall-clock latency** (seconds). | +## Objective mode (required) -Omit both parameters (or leave them at `0.0`) for the original accuracy-centric behavior. Set one or both when you want multi-metric selection. +You must set `objective_mode` on every selector. -### Formula +### `objective_mode="weighted"` -For each datapoint, after observations are recorded: +Pass at least one of `lambda_cost > 0` or `lambda_latency > 0`. The library returns a single **`is_best`** combo using a linear scalar (accuracy minus weighted normalized cost/latency): ``` -combined = score - - lambda_cost * norm(cost) - - lambda_latency * norm(latency) +combined = score - lambda_cost * norm(cost) - lambda_latency * norm(latency) ``` -- **`score`** — return value of `eval_fn` (higher is better). -- **`norm(·)`** — min–max scale to `[0, 1]` using running min/max over **all** samples seen during that selector run (updated as more combos are evaluated). -- **Per combination** — mean of per-datapoint combined values → `ModelResult.combined_objective` (see [results.md](results.md)). +```python +selector = ModelSelector( + ..., + objective_mode="weighted", + lambda_cost=0.3, + lambda_latency=0.2, + model_prices={...}, +) +results = selector.select_best() +best = results.get_best() +``` -This is a **linear scalarization**, not Pareto exploration. Larger `lambda_*` penalize cost/latency more strongly relative to score. +### `objective_mode="pareto"` -### Example +Do **not** pass `lambda_cost` or `lambda_latency`. The library minimizes **error** (`1 - score`), **latency**, and **cost** (when priced), marks nondominated combos, and exposes `results.get_pareto_front()` and `results.plot_pareto()` (error on the y-axis; ideal corner at 0). ```python selector = ModelSelector( - agent=MyAgent, - models=models, - eval_fn=eval_fn, - dataset=dataset, + ..., method="matrix_ucb", - lambda_cost=0.3, # optional — omit for accuracy-only - lambda_latency=0.2, - model_prices={ # recommended when lambda_cost > 0 - "gpt-4o": {"input_price": 2.5, "output_price": 10.0}, - "gpt-4o-mini": {"input_price": 0.15, "output_price": 0.6}, - }, + objective_mode="pareto", ) -results = selector.select_best(parallel=True) -results.print_summary() # ranks by combined_objective when lambdas are set +results = selector.select_best() +results.get_pareto_front() +results.plot_pareto() ``` -### How each method uses the weights +For `matrix_ucb` / `matrix_ucb_lrf`, exploration uses **Chebyshev scalarization** over normalized gaps (ideal = 0 error, 0s, $0); tradeoff directions rotate automatically — no extra knobs. -| Methods | During search | Final `is_best` | +| Methods | Weighted search | Pareto search | |:---|:---|:---| -| `matrix_ucb`, `matrix_ucb_lrf` | UCB rewards use per-cell combined objective | `_find_best` on `combined_objective` | -| `arm_elimination`, `epsilon_lucb`, `threshold` | Elimination / LUCB stats on combined per-sample objectives | same | -| `hill_climbing`, `bayesian` | Move / surrogate target uses combined objective | same | -| `brute_force`, `random` | Does not steer *which* combos to try | same | -| `lm_proposal` | Proposer uses `objective=` **text**, not these lambdas | `combined_objective` on the one evaluated combo only | - -After `select_best()`, a final pass recomputes every result’s `combined_objective` against the **full-run** normalizer so rankings are comparable. +| `matrix_ucb`, `matrix_ucb_lrf` | Per-cell linear combined objective | Chebyshev cell reward | +| Other bandits | Combined per-sample stats where applicable | Full eval → frontier marking | +| `brute_force`, `random` | Final rank only | Final frontier only | !!! note "`lm_proposal` vs lambdas" - `LMProposalModelSelector(objective="...")` is a natural-language hint to the **proposer LLM**. It is separate from `lambda_cost` / `lambda_latency`, which only affect the scalar reward used for ranking and bandit methods. + `LMProposalModelSelector(objective="...")` is a natural-language hint to the **proposer LLM**. It is separate from `objective_mode` and `lambda_*`. ## `select_best()` diff --git a/examples/selection/daemon/basic.py b/examples/selection/daemon/basic.py index c3f7085..447d09d 100644 --- a/examples/selection/daemon/basic.py +++ b/examples/selection/daemon/basic.py @@ -91,6 +91,7 @@ def eval_fn(expected: str, actual: str) -> float: eval_fn=eval_fn, dataset=dataset, method="brute_force", + objective_mode="pareto", ) results = selector.select_best(parallel=False) results.print_summary() diff --git a/examples/selection/local/advanced_algorithms.py b/examples/selection/local/advanced_algorithms.py index 0f9540b..e3693c8 100644 --- a/examples/selection/local/advanced_algorithms.py +++ b/examples/selection/local/advanced_algorithms.py @@ -98,7 +98,13 @@ def eval_fn(expected, actual): def run_auto(): """method="auto" — automatically finds the best combination (default; wired to arm_elimination — strong best-arm identification, cheaper than brute_force).""" selector = ModelSelector( - agent=MyAgent, models=models, eval_fn=eval_fn, dataset=dataset, method="auto", + agent=MyAgent, + models=models, + eval_fn=eval_fn, + dataset=dataset, + method="auto", + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -112,6 +118,8 @@ def run_random(): dataset=dataset, method="random", sample_fraction=0.25, # evaluate 25% of all combinations + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -125,6 +133,8 @@ def run_hill_climbing(): dataset=dataset, method="hill_climbing", batch_size=4, # number of neighbors to evaluate per step + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -137,6 +147,8 @@ def run_arm_elimination(): eval_fn=eval_fn, dataset=dataset, method="arm_elimination", + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -150,6 +162,8 @@ def run_epsilon_lucb(): dataset=dataset, method="epsilon_lucb", epsilon=0.01, # acceptable gap from the true best + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -163,6 +177,8 @@ def run_threshold(): dataset=dataset, method="threshold", threshold=0.75, # minimum acceptable accuracy + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -175,6 +191,8 @@ def run_lm_proposal(): eval_fn=eval_fn, dataset=dataset, method="lm_proposal", + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -189,6 +207,8 @@ def run_bayesian(): method="bayesian", batch_size=4, sample_fraction=0.25, # evaluate 25% of all combinations + objective_mode="weighted", + lambda_latency=0.2, ) return selector.select_best(parallel=True) @@ -203,6 +223,7 @@ def run_matrix_ucb(): method="matrix_ucb", a=1.0, sample_fraction=0.1, + objective_mode="pareto", ) return selector.select_best(max_concurrent=4) @@ -221,6 +242,7 @@ def run_matrix_ucb_lrf(): eta=5.0, warmup_fraction=0.05, sample_fraction=0.1, + objective_mode="pareto", ) # Unlike matrix_ucb (which always uses async eval), LRF still uses parallel=True # for concurrent cell evaluation; sequential path is sync-only. diff --git a/examples/selection/local/ag2.py b/examples/selection/local/ag2.py index d70f298..4d6a1f1 100644 --- a/examples/selection/local/ag2.py +++ b/examples/selection/local/ag2.py @@ -97,6 +97,7 @@ def eval_fn(expected, actual): eval_fn=eval_fn, dataset=dataset, method="brute_force", # or "auto" for smarter selection algorithms + objective_mode="pareto", ) results = selector.select_best(parallel=True) diff --git a/examples/selection/local/crewai.py b/examples/selection/local/crewai.py index e728894..3369806 100644 --- a/examples/selection/local/crewai.py +++ b/examples/selection/local/crewai.py @@ -111,6 +111,7 @@ def eval_fn(expected, actual): eval_fn=eval_fn, dataset=dataset, method="brute_force", # or "auto" for smarter selection algorithms + objective_mode="pareto", ) results = selector.select_best(parallel=True) diff --git a/examples/selection/local/custom_agent.py b/examples/selection/local/custom_agent.py index e33af97..7dcf425 100644 --- a/examples/selection/local/custom_agent.py +++ b/examples/selection/local/custom_agent.py @@ -111,6 +111,7 @@ def eval_fn(expected, actual): eval_fn=eval_fn, dataset=dataset, method="brute_force", # or "auto" for smarter selection algorithms + objective_mode="pareto", ) results = selector.select_best(parallel=True) diff --git a/examples/selection/local/langchain.py b/examples/selection/local/langchain.py index 27b48c1..e39cc5a 100644 --- a/examples/selection/local/langchain.py +++ b/examples/selection/local/langchain.py @@ -92,6 +92,7 @@ def eval_fn(expected, actual): eval_fn=eval_fn, dataset=dataset, method="brute_force", # or "auto" for smarter selection algorithms + objective_mode="pareto", ) results = selector.select_best(parallel=True) diff --git a/examples/selection/local/langgraph.py b/examples/selection/local/langgraph.py index d9ace3f..134b6b1 100644 --- a/examples/selection/local/langgraph.py +++ b/examples/selection/local/langgraph.py @@ -113,6 +113,7 @@ def eval_fn(expected, actual): eval_fn=eval_fn, dataset=dataset, method="brute_force", # or "auto" for smarter selection algorithms + objective_mode="pareto", ) results = selector.select_best(parallel=True) diff --git a/examples/selection/local/llamaindex.py b/examples/selection/local/llamaindex.py index 26d41bf..e8299d2 100644 --- a/examples/selection/local/llamaindex.py +++ b/examples/selection/local/llamaindex.py @@ -103,6 +103,7 @@ def eval_fn(expected, actual): eval_fn=eval_fn, dataset=dataset, method="brute_force", # or "auto" for smarter selection algorithms + objective_mode="pareto", ) results = selector.select_best(parallel=True) diff --git a/examples/selection/local/openai_sdk.py b/examples/selection/local/openai_sdk.py index 6f67a50..38d33b5 100644 --- a/examples/selection/local/openai_sdk.py +++ b/examples/selection/local/openai_sdk.py @@ -88,6 +88,7 @@ def eval_fn(expected, actual): eval_fn=eval_fn, dataset=dataset, method="brute_force", # or "auto" for smarter selection algorithms + objective_mode="pareto", ) results = selector.select_best(parallel=True) diff --git a/examples/shared/openclaw_agent.py b/examples/shared/openclaw_agent.py index a2eb9f3..01d05b6 100644 --- a/examples/shared/openclaw_agent.py +++ b/examples/shared/openclaw_agent.py @@ -20,6 +20,7 @@ eval_fn=my_eval_fn, dataset=my_dataset, method="brute_force", + objective_mode="pareto", ) results = selector.select_best(parallel=False) diff --git a/src/agentopt/__init__.py b/src/agentopt/__init__.py index 0790714..3058a8d 100644 --- a/src/agentopt/__init__.py +++ b/src/agentopt/__init__.py @@ -135,11 +135,10 @@ def ModelSelector( ``"epsilon_lucb"``, ``"matrix_ucb"``, ``"matrix_ucb_lrf"``, ``"threshold"``, ``"lm_proposal"``, ``"bayesian"``. - **kwargs: Additional arguments passed to the selector - (e.g. ``epsilon``, ``threshold``, ``sample_fraction``, ``warmup_fraction`` - for matrix UCB-LRF; ``lambda_cost``, ``lambda_latency`` for the optional - combined objective ``score - lambda_cost*norm_cost - - lambda_latency*norm_latency`` — both default to ``0.0`` / accuracy-only). + **kwargs: Additional arguments passed to the selector. Required: + ``objective_mode`` — ``"weighted"`` (pass ``lambda_cost`` and/or + ``lambda_latency`` > 0) or ``"pareto"`` (frontier; Chebyshev matrix UCB). + Other options: ``epsilon``, ``threshold``, ``sample_fraction``, etc. Returns: A selector instance. Call ``.select_best()`` to run. diff --git a/src/agentopt/model_selection/__init__.py b/src/agentopt/model_selection/__init__.py index 56feae6..589a10c 100644 --- a/src/agentopt/model_selection/__init__.py +++ b/src/agentopt/model_selection/__init__.py @@ -9,6 +9,7 @@ from .random_search import RandomSearchModelSelector from .threshold_successive_elimination import ThresholdBanditSEModelSelector from .matrix_ucb import MatrixUCBLRFModelSelector, MatrixUCBModelSelector +from .objectives import ObjectiveMode # Bayesian is optional (requires torch/botorch) try: @@ -31,4 +32,5 @@ "DatapointResult", "ModelResult", "SelectionResults", + "ObjectiveMode", ] diff --git a/src/agentopt/model_selection/arm_elimination.py b/src/agentopt/model_selection/arm_elimination.py index 35df3e2..9bb1452 100644 --- a/src/agentopt/model_selection/arm_elimination.py +++ b/src/agentopt/model_selection/arm_elimination.py @@ -29,6 +29,7 @@ def __init__( confidence: float = 1.0, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -39,6 +40,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -152,7 +154,9 @@ def _select_sequential(self) -> SelectionResults: all_results = self._build_results( all_combos, combo_scores, combo_latencies, combo_costs, combo_dp_ids ) - return SelectionResults(results=all_results) + return SelectionResults( + results=all_results, objective_mode=self.objective_mode, + ) async def _select_async(self, max_concurrent: int = 20) -> SelectionResults: all_combos = self._all_combos() @@ -274,7 +278,9 @@ async def _eval_batch( all_results = self._build_results( all_combos, combo_scores, combo_latencies, combo_costs, combo_dp_ids ) - return SelectionResults(results=all_results) + return SelectionResults( + results=all_results, objective_mode=self.objective_mode, + ) # ------------------------------------------------------------------ # Statistical helpers @@ -341,13 +347,16 @@ def _build_results( ) self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for result in all_results: - if result.model_name == best_name: - result.is_best = True - break + if self.objective_mode == "pareto": + self._mark_pareto_optimal(all_results) else: - print("\n No combinations succeeded.") + best_info = self._find_best(all_results) + if best_info is not None: + best_name, _ = best_info + for result in all_results: + if result.model_name == best_name: + result.is_best = True + break + else: + print("\n No combinations succeeded.") return all_results diff --git a/src/agentopt/model_selection/base.py b/src/agentopt/model_selection/base.py index 177487f..374038b 100644 --- a/src/agentopt/model_selection/base.py +++ b/src/agentopt/model_selection/base.py @@ -23,6 +23,14 @@ validate_dataset, ) from ..model_price import compute_price +from .objectives import ( + ObjectiveMode, + chebyshev_scalar, + chebyshev_weight_at, + pareto_mask_3d, + score_to_error, + validate_objective_config, +) logger = logging.getLogger(__name__) @@ -47,6 +55,10 @@ class ModelResult(BaseModel): output_tokens: Dict[str, int] = Field(default_factory=dict) attribute: str is_best: bool = False + is_pareto_optimal: bool = Field( + default=False, + description="True when this combo lies on the empirical Pareto frontier.", + ) datapoint_results: List[DatapointResult] = Field(default_factory=list) combined_objective: Optional[float] = Field( default=None, @@ -58,6 +70,11 @@ class ModelResult(BaseModel): ) _custom_prices: Optional[Dict[str, Tuple[float, float]]] = PrivateAttr(default=None) + @property + def error(self) -> float: + """Per-combo error rate ``1 - accuracy`` (for Pareto plots; scores in [0, 1]).""" + return score_to_error(self.accuracy) + @property def num_samples(self) -> int: """Number of datapoints evaluated; falls back to 1 for failed combos.""" @@ -102,6 +119,10 @@ class SelectionResults(BaseModel): """Results from model selection.""" results: List[ModelResult] = Field(default_factory=list) + objective_mode: Optional[ObjectiveMode] = Field( + default=None, + description="``weighted`` (single best) or ``pareto`` (frontier).", + ) selection_wall_time_seconds: Optional[float] = None selection_cost: Optional[float] = Field( default=None, description="Total selection cost in USD.", @@ -142,6 +163,37 @@ def get_by_attribute(self, attribute: str) -> List[ModelResult]: """Get all results for a specific attribute.""" return [r for r in self.results if r.attribute == attribute] + def get_pareto_front(self) -> List[ModelResult]: + """Return Pareto-optimal combinations (error, latency, cost all minimized). + + Uses the same deduplication and final-layer filtering as :meth:`plot_pareto`. + """ + unique = self._comparable_results() + if not unique: + return [] + return [r for r in unique if r.is_pareto_optimal] + + def _comparable_results(self) -> List[ModelResult]: + """Deduplicated results at the fullest evaluation depth.""" + seen: Dict[str, ModelResult] = {} + for r in self.results: + if r.model_name not in seen or ( + r.is_best and not seen[r.model_name].is_best + ): + seen[r.model_name] = r + unique = list(seen.values()) + if not unique: + return [] + if any(r.datapoint_results for r in unique): + max_samples = max( + r.num_samples for r in unique if r.datapoint_results + ) + unique = [ + r for r in unique + if r.datapoint_results and r.num_samples == max_samples + ] + return unique + def export_config( self, output_path: str, api_key_env_vars: Optional[Dict[str, str]] = None, ) -> None: @@ -487,8 +539,22 @@ def row( lines.append(sep) + pareto_results = self.get_pareto_front() if self.objective_mode == "pareto" else [] best_result = next((r for r in unique if r.is_best), None) - if best_result: + if self.objective_mode == "pareto" and pareto_results: + lines.append("") + lines.append( + f"{pad} Pareto-optimal: {len(pareto_results)} combination(s) " + f"(ideal: 0% error, 0s latency, $0)" + ) + for r in pareto_results[:8]: + lines.append( + f"{pad} {r.model_name} — error {r.error:.2%}, " + f"{r.latency_seconds:.2f}s, {fmt_price(r)}" + ) + if len(pareto_results) > 8: + lines.append(f"{pad} ... and {len(pareto_results) - 8} more") + elif best_result: lines.append( f"{pad} Best: {best_result.model_name} " f"(accuracy: {best_result.accuracy:.2%}, " @@ -544,14 +610,15 @@ def _pareto_mask( break return mask - def plot_pareto(self, path: Optional[str] = None) -> None: - """Generate two pairwise Pareto frontier plots. + def plot_pareto( + self, path: Optional[str] = None, *, show_ideal: bool = True, + ) -> None: + """Generate pairwise Pareto frontier plots (error, latency, price). - Subplots: Accuracy vs Latency, Accuracy vs Price. + Both axes are **lower is better**. The ideal wish corner is + (0 error, 0s latency, $0) when ``show_ideal`` is True. Requires ``matplotlib`` (install with ``pip install agentopt-py[plot]``). - If *path* is given the figure is saved to that file, otherwise - ``plt.show()`` is called. """ try: import matplotlib.pyplot as plt @@ -561,114 +628,97 @@ def plot_pareto(self, path: Optional[str] = None) -> None: "Install it with: pip install agentopt-py[plot]" ) - # Deduplicate (same logic as __str__). - seen: Dict[str, "ModelResult"] = {} - for r in self.results: - if r.model_name not in seen or ( - r.is_best and not seen[r.model_name].is_best - ): + unique = self._comparable_results() + if not unique: + seen: Dict[str, ModelResult] = {} + for r in self.results: seen[r.model_name] = r - all_unique = [r for r in seen.values() if r.price is not None] + unique = list(seen.values()) - # For bandit algorithms, only plot the final layer (combos with the - # most datapoints) so all plotted combos are directly comparable. - if all_unique: - max_samples = max(r.num_samples for r in all_unique) - unique = [r for r in all_unique if r.num_samples == max_samples] - else: - unique = all_unique + # Prefer combos with price for the cost panel; keep all for error/latency. + with_price = [r for r in unique if r.price is not None] + plot_set = with_price if len(with_price) >= 2 else unique - # Sort so numbering matches the final results table rank order. - unique.sort(key=lambda r: (-r.accuracy, r.latency_seconds)) + plot_set = list(plot_set) + plot_set.sort(key=lambda r: (r.error, r.latency_seconds)) - if len(unique) < 2: - print("Not enough results with pricing data to plot.") + if len(plot_set) < 2: + print("Not enough comparable results to plot (need at least 2).") return - names = [r.model_name for r in unique] - accs = [r.accuracy for r in unique] - lats = [r.latency_seconds for r in unique] - prices = [r.price for r in unique] # type: ignore[misc] - is_best = [r.is_best for r in unique] + names = [r.model_name for r in plot_set] + errors = [r.error for r in plot_set] + lats = [r.latency_seconds for r in plot_set] + prices = [r.price for r in plot_set] + has_price = [p is not None for p in prices] + is_pareto = [r.is_pareto_optimal for r in plot_set] + is_best = [r.is_best for r in plot_set] - # Build numbered labels: (1), (2), ... - num_labels = [f"({i})" for i in range(1, len(unique) + 1)] + num_labels = [f"({i})" for i in range(1, len(plot_set) + 1)] - pairs = [ - (lats, accs, "Latency (s)", "Accuracy", True, False), - (prices, accs, "Price ($)", "Accuracy", True, False), + pairs: List[Tuple[List[float], List[float], str, str]] = [ + (lats, errors, "Latency (s)", "Error"), ] + if any(has_price): + px = [p for p, ok in zip(prices, has_price) if ok] # type: ignore[misc] + ey = [e for e, ok in zip(errors, has_price) if ok] + if len(px) >= 2: + pairs.append((px, ey, "Price ($)", "Error")) + + fig = plt.figure(figsize=(7 * len(pairs), 5)) + gs = fig.add_gridspec(1, len(pairs), left=0.06, right=0.68, wspace=0.3) + axes = [fig.add_subplot(gs[0, i]) for i in range(len(pairs))] + title = "Pareto Frontiers (lower is better)" + if self.objective_mode == "weighted": + title += " — weighted run" + fig.suptitle(title, fontsize=14, fontweight="bold") + + for ax, (xs, ys, xlabel, ylabel) in zip(axes, pairs): + mask = self._pareto_mask(xs, ys, True, True) - fig = plt.figure(figsize=(14, 5)) - # Reserve right margin for the legend. - gs = fig.add_gridspec(1, 2, left=0.06, right=0.68, wspace=0.3) - axes = [fig.add_subplot(gs[0, i]) for i in range(2)] - fig.suptitle("Pareto Frontiers", fontsize=14, fontweight="bold") - - for ax, (xs, ys, xlabel, ylabel, x_min, y_min) in zip(axes, pairs): - mask = self._pareto_mask(xs, ys, x_min, y_min) - - # Non-Pareto points. np_x = [x for x, m in zip(xs, mask) if not m] np_y = [y for y, m in zip(ys, mask) if not m] ax.scatter( - np_x, - np_y, - c="lightgray", - edgecolors="gray", - s=60, - zorder=2, - label="Dominated", + np_x, np_y, c="lightgray", edgecolors="gray", s=60, + zorder=2, label="Dominated", ) - # Pareto-optimal points. p_x = [x for x, m in zip(xs, mask) if m] p_y = [y for y, m in zip(ys, mask) if m] ax.scatter( - p_x, - p_y, - c="steelblue", - edgecolors="navy", - s=80, - zorder=3, - label="Pareto-optimal", + p_x, p_y, c="steelblue", edgecolors="navy", s=80, + zorder=3, label="Pareto-optimal", ) - # Connect frontier with a line (sorted by x). if p_x: order = sorted(range(len(p_x)), key=lambda i: p_x[i]) ax.plot( - [p_x[i] for i in order], - [p_y[i] for i in order], - c="steelblue", - linewidth=1.5, - alpha=0.6, - zorder=2, + [p_x[i] for i in order], [p_y[i] for i in order], + c="steelblue", linewidth=1.5, alpha=0.6, zorder=2, + ) + + if show_ideal: + ax.scatter( + [0.0], [0.0], c="none", edgecolors="green", s=120, + linewidths=2, marker="o", zorder=1, label="Ideal (0, 0)", ) - # Highlight best combo. - for x, y, b in zip(xs, ys, is_best): + for x, y, b, po in zip(xs, ys, is_best, is_pareto): if b: ax.scatter( - [x], - [y], - c="gold", - edgecolors="darkorange", - s=140, - zorder=4, - marker="*", - label="Best", + [x], [y], c="gold", edgecolors="darkorange", s=140, + zorder=5, marker="*", label="Best (weighted)", + ) + elif po and self.objective_mode == "pareto": + ax.scatter( + [x], [y], facecolors="none", edgecolors="darkorange", + s=120, zorder=4, linewidths=1.5, label="Frontier", ) - # Number labels on points. for x, y, lbl in zip(xs, ys, num_labels): ax.annotate( - lbl, - (x, y), - textcoords="offset points", - xytext=(5, 5), - fontsize=7, - fontweight="bold", + lbl, (x, y), textcoords="offset points", xytext=(5, 5), + fontsize=7, fontweight="bold", ) ax.set_xlabel(xlabel) @@ -718,6 +768,7 @@ def __init__( model_prices: Optional[Dict[str, Dict[str, float]]] = None, node_descriptions: Optional[Dict[str, str]] = None, tracker: Optional[LLMTracker] = None, + objective_mode: Optional[ObjectiveMode] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -744,20 +795,23 @@ def __init__( ``{"planner": "Decomposes queries into sub-tasks"}``. tracker: Optional :class:`LLMTracker` instance. If not provided, one is created and started automatically. - lambda_cost: Weight on normalized per-sample cost in the combined - objective ``score - lambda_cost*norm_cost - lambda_latency*norm_latency``. - Cost is normalized adaptively against the running min/max of all - samples observed during this selector's lifetime. Default 0.0 - (pure accuracy, backwards compatible). - lambda_latency: Weight on normalized per-sample latency in the - combined objective. Default 0.0. + objective_mode: Required. ``"weighted"`` — pass ``lambda_cost`` + and/or ``lambda_latency`` > 0 for a single recommended combo. + ``"pareto"`` — omit lambdas; explore the error/latency/cost + frontier (Chebyshev-driven matrix UCB; full eval otherwise). + lambda_cost: Weight on normalized per-sample cost (weighted mode only). + lambda_latency: Weight on normalized per-sample latency (weighted mode only). """ if agent is None: raise TypeError("'agent' is required") if models is None or eval_fn is None or dataset is None: raise TypeError("'models', 'eval_fn', and 'dataset' are required") - if float(lambda_cost) < 0 or float(lambda_latency) < 0: - raise ValueError("lambda_cost and lambda_latency must be non-negative") + + self.objective_mode = validate_objective_config( + objective_mode, lambda_cost, lambda_latency, + ) + self.lambda_cost = float(lambda_cost) + self.lambda_latency = float(lambda_latency) validate_dataset(dataset) @@ -777,14 +831,15 @@ def __init__( self._node_names = list(models.keys()) self.model_prices = model_prices self.node_descriptions = node_descriptions - self.lambda_cost = float(lambda_cost) - self.lambda_latency = float(lambda_latency) - # Running min/max for adaptive [0,1] normalization of cost/latency. + # Running min/max for adaptive [0,1] normalization of cost/latency/gaps. self._cost_min: float = float("inf") self._cost_max: float = float("-inf") self._latency_min: float = float("inf") self._latency_max: float = float("-inf") + self._error_min: float = float("inf") + self._error_max: float = float("-inf") + self._chebyshev_step: int = 0 # Detect whether agent.run() is async run_method = getattr(agent, "run", None) @@ -997,8 +1052,17 @@ def _fetch_tokens_by_datapoint( @property def _has_combined_objective(self) -> bool: - """True when at least one of the cost/latency lambdas is nonzero.""" - return self.lambda_cost > 0.0 or self.lambda_latency > 0.0 + """True in weighted mode with cost/latency lambdas configured.""" + return self.objective_mode == "weighted" + + @property + def _pareto_exploration(self) -> bool: + return self.objective_mode == "pareto" + + @property + def _uses_matrix_scalar_refresh(self) -> bool: + """Matrix UCB cells need recomputation when the normalizer moves.""" + return self._has_combined_objective or self._pareto_exploration def _per_sample_costs(self, dp_ids: List[str]) -> List[Optional[float]]: """Look up per-datapoint cost from tracked tokens; ``None`` if unpriced.""" @@ -1013,11 +1077,48 @@ def _per_sample_costs(self, dp_ids: List[str]) -> List[Optional[float]]: ) return costs + def _absorb_pareto_gaps( + self, scores: List[float], latencies: List[float], costs: List[Optional[float]], + ) -> None: + """Update running min/max for error, latency, and cost gaps (Pareto mode).""" + if not self._pareto_exploration: + return + for sc in scores: + err = score_to_error(sc) + if err < self._error_min: + self._error_min = err + if err > self._error_max: + self._error_max = err + self._absorb_observations(latencies, costs) + + def _normalized_pareto_gaps( + self, score: float, latency: float, cost: Optional[float], + ) -> Tuple[float, float, float]: + """Min-max normalized (error, latency, cost) gaps in [0, 1] (ideal = 0).""" + err = score_to_error(score) + ne = self._minmax_norm(err, self._error_min, self._error_max) + nl = self._minmax_norm(latency, self._latency_min, self._latency_max) + nc = self._minmax_norm(cost, self._cost_min, self._cost_max) + return ne, nl, nc + + def _chebyshev_cell_scalar( + self, score: float, latency: float, cost: Optional[float], + weights: Optional[Tuple[float, float, float]] = None, + ) -> float: + """Chebyshev achievement scalar for one cell (lower is better).""" + w = weights if weights is not None else chebyshev_weight_at(self._chebyshev_step) + ne, nl, nc = self._normalized_pareto_gaps(score, latency, cost) + use_cost = cost is not None and math.isfinite(cost) + return chebyshev_scalar(ne, nl, nc, w, use_cost=use_cost) + + def _advance_chebyshev_direction(self) -> None: + self._chebyshev_step += 1 + def _absorb_observations( self, latencies: List[float], costs: List[Optional[float]], ) -> None: """Update the running min/max with new samples.""" - if not self._has_combined_objective: + if not self._has_combined_objective and not self._pareto_exploration: return for lat in latencies: if lat is None or not math.isfinite(lat): @@ -1097,7 +1198,10 @@ def _observe_combo( recomputation when the normalizer changes. """ costs = self._per_sample_costs(dp_ids) - self._absorb_observations(latencies, costs) + if self._pareto_exploration: + self._absorb_pareto_gaps(scores, latencies, costs) + else: + self._absorb_observations(latencies, costs) return costs def _recover_costs(self, result: ModelResult) -> List[Optional[float]]: @@ -1142,6 +1246,50 @@ def _finalize_combined_objectives(self, results: List[ModelResult]) -> None: for r, scores, lats, costs in cached: r.combined_objective = self._mean_objective(scores, lats, costs) + def _mark_pareto_optimal(self, results: List[ModelResult]) -> None: + """Set ``is_pareto_optimal`` on results at the fullest evaluation depth.""" + for r in results: + r.is_pareto_optimal = False + candidates = [ + r for r in results + if r.datapoint_results and r.accuracy >= 0.0 + ] + if not candidates: + return + max_samples = max(r.num_samples for r in candidates) + layer = [r for r in candidates if r.num_samples == max_samples] + errors = [r.error for r in layer] + lats = [r.latency_seconds for r in layer] + costs: List[Optional[float]] = [r.price for r in layer] + mask = pareto_mask_3d(errors, lats, costs) + names = {layer[i].model_name for i, m in enumerate(mask) if m} + for r in results: + if r.model_name in names and r.num_samples == max_samples: + r.is_pareto_optimal = True + + def _finalize_selection_outcomes( + self, results: List[ModelResult], + ) -> SelectionResults: + """Apply weighted best pick or Pareto marking; wrap :class:`SelectionResults`.""" + self._finalize_combined_objectives(results) + if self.objective_mode == "pareto": + self._mark_pareto_optimal(results) + return SelectionResults( + results=results, objective_mode="pareto", + ) + best_info = self._find_best(results) + if best_info is not None: + best_name, _ = best_info + for r in results: + if r.model_name == best_name: + r.is_best = True + break + else: + print("\n No combinations succeeded.") + return SelectionResults( + results=results, objective_mode="weighted", + ) + # ------------------------------------------------------------------ # Result helpers # ------------------------------------------------------------------ @@ -1180,9 +1328,15 @@ def _build_combo_result( ) if costs is None: costs = self._observe_combo(scores, latencies, dp_ids) + elif self._pareto_exploration: + self._absorb_pareto_gaps(scores, latencies, costs) else: self._absorb_observations(latencies, costs) - combined = self._mean_objective(scores, latencies, costs) + combined = ( + self._mean_objective(scores, latencies, costs) + if self._has_combined_objective + else None + ) return self._make_result( model_name=combo_name, accuracy=accuracy, @@ -1359,6 +1513,8 @@ def select_best( result.selection_cost = compute_price( input_tokens, output_tokens, custom_prices=self._custom_prices, ) + if result.objective_mode is None: + result.objective_mode = self.objective_mode return result diff --git a/src/agentopt/model_selection/bayesian_optimization.py b/src/agentopt/model_selection/bayesian_optimization.py index 2fb9c26..b22b662 100644 --- a/src/agentopt/model_selection/bayesian_optimization.py +++ b/src/agentopt/model_selection/bayesian_optimization.py @@ -42,6 +42,7 @@ def __init__( sample_fraction: float = 0.25, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -52,6 +53,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -193,17 +195,9 @@ def _bo_fit_and_acquire( return [unseen[i] for i in topk] def _bo_finalize(self, all_results: List[ModelResult]) -> SelectionResults: - self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for result in all_results: - if result.model_name == best_name: - result.is_best = True - break - else: + if not any(r.datapoint_results for r in all_results): logger.warning("No successful evaluations.") - return SelectionResults(results=all_results) + return self._finalize_selection_outcomes(all_results) def _bo_target_from_result(self, result: ModelResult) -> float: """BO target: combined objective if lambdas set, else accuracy.""" diff --git a/src/agentopt/model_selection/brute_force.py b/src/agentopt/model_selection/brute_force.py index 35a8005..bf92220 100644 --- a/src/agentopt/model_selection/brute_force.py +++ b/src/agentopt/model_selection/brute_force.py @@ -25,6 +25,7 @@ def __init__( dataset: Dataset = None, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -35,6 +36,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -83,19 +85,7 @@ def _select_sequential(self) -> SelectionResults: ) ) - self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for result in all_results: - if result.model_name == best_name: - result.is_best = True - break - else: - print("\n No combinations succeeded") - - results = SelectionResults(results=all_results) - return results + return self._finalize_selection_outcomes(all_results) async def _select_async(self, max_concurrent: int = 20) -> SelectionResults: all_combos = self._all_combos() @@ -151,16 +141,4 @@ async def _eval_combo( _, result = res all_results.append(result) - self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for r in all_results: - if r.model_name == best_name: - r.is_best = True - break - else: - print("\n No combinations succeeded") - - results = SelectionResults(results=all_results) - return results + return self._finalize_selection_outcomes(all_results) diff --git a/src/agentopt/model_selection/epsilon_lucb.py b/src/agentopt/model_selection/epsilon_lucb.py index 52cd05c..ae3bb44 100644 --- a/src/agentopt/model_selection/epsilon_lucb.py +++ b/src/agentopt/model_selection/epsilon_lucb.py @@ -29,6 +29,7 @@ def __init__( confidence: float = 1.0, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -39,6 +40,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -327,15 +329,4 @@ def _build_results( ) ) - self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for result in all_results: - if result.model_name == best_name: - result.is_best = True - break - else: - print("\n No combinations succeeded.") - - return SelectionResults(results=all_results) + return self._finalize_selection_outcomes(all_results) diff --git a/src/agentopt/model_selection/hill_climbing.py b/src/agentopt/model_selection/hill_climbing.py index b6b8ab9..9768f29 100644 --- a/src/agentopt/model_selection/hill_climbing.py +++ b/src/agentopt/model_selection/hill_climbing.py @@ -31,6 +31,7 @@ def __init__( batch_size: int = 1, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -41,6 +42,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -255,33 +257,14 @@ def _hc_finalize( global_best_combo: Optional[Dict[str, ModelCandidate]], global_best_value: float, ) -> SelectionResults: - """Finalize combined objectives, mark the best result, return results.""" - self._finalize_combined_objectives(all_results) + """Finalize weighted best or Pareto frontier marking.""" + del global_best_value if global_best_combo is None: print("\nNo combinations succeeded\n") - return SelectionResults(results=all_results) - - # Prefer the combined-objective-aware _find_best when lambdas are set; - # otherwise honor the within-search global best to preserve the - # original tie-breaking semantics. - if self._has_combined_objective: - best_info = self._find_best(all_results) - best_name = best_info[0] if best_info else self._combo_name(global_best_combo) - else: - best_name = self._combo_name(global_best_combo) - - tol = 1e-9 - for result in all_results: - if result.model_name != best_name: - continue - if self._has_combined_objective: - result.is_best = True - break - # Accuracy-mode: match by name AND the tracked best value. - if abs(result.accuracy - global_best_value) < tol: - result.is_best = True - break - return SelectionResults(results=all_results) + return SelectionResults( + results=all_results, objective_mode=self.objective_mode, + ) + return self._finalize_selection_outcomes(all_results) # ------------------------------------------------------------------ # Single restart (sequential) diff --git a/src/agentopt/model_selection/lm_proposal.py b/src/agentopt/model_selection/lm_proposal.py index addf44d..2fb2f10 100644 --- a/src/agentopt/model_selection/lm_proposal.py +++ b/src/agentopt/model_selection/lm_proposal.py @@ -48,6 +48,7 @@ def __init__( dataset_preview_size: int = 10, model_prices: Optional[Dict[str, Dict[str, float]]] = None, node_descriptions: Optional[Dict[str, str]] = None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -58,6 +59,7 @@ def __init__( dataset=dataset, model_prices=model_prices, node_descriptions=node_descriptions, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -131,9 +133,7 @@ def _run_selection( is_best=True, ) - results = [result] - self._finalize_combined_objectives(results) - return SelectionResults(results=results) + return self._finalize_selection_outcomes([result]) # ------------------------------------------------------------------ # Prompt construction diff --git a/src/agentopt/model_selection/matrix_ucb.py b/src/agentopt/model_selection/matrix_ucb.py index b6f45b8..2b1ed91 100644 --- a/src/agentopt/model_selection/matrix_ucb.py +++ b/src/agentopt/model_selection/matrix_ucb.py @@ -18,6 +18,10 @@ ``observation_budget_fraction`` (default ``1.0``) limits how much of the grid is observed; below ``1.0`` the run stops once that fraction (ceiling) of cells is filled. + +In ``objective_mode="pareto"``, cell rewards use **Chebyshev scalarization** over +normalized error, latency, and cost (ideal corner 0); tradeoff directions rotate +automatically. """ from __future__ import annotations @@ -32,6 +36,7 @@ from ..base_models import Dataset, EvalFn, ModelCandidate from .base import BaseModelSelector, ModelResult, SelectionResults +from .objectives import chebyshev_scalar, chebyshev_weight_at logger = logging.getLogger(__name__) @@ -104,6 +109,90 @@ def _ucb_plain_next_batch( return np.stack([combos, dps.astype(np.int64)]) +def _ucb_chebyshev_next_batch( + selector: BaseModelSelector, + cell_data: Dict[Tuple[int, int], Tuple[float, float, Optional[float], str]], + n_combos: int, + n_datapoints: int, + observed_mask: np.ndarray, + a: float, + max_cells: int, + rng: np.random.Generator, +) -> Optional[np.ndarray]: + """Pick cells via pessimistic Chebyshev bounds (lower is better).""" + weights = chebyshev_weight_at(selector._chebyshev_step) + bounds = np.full(n_combos, np.inf, dtype=np.float64) + counts = np.zeros(n_combos, dtype=np.int64) + + for combo_i in range(n_combos): + scores: List[float] = [] + lats: List[float] = [] + costs: List[Optional[float]] = [] + for dp_i in range(n_datapoints): + if not observed_mask[combo_i, dp_i]: + continue + t = cell_data.get((combo_i, dp_i)) + if not t or t[3].startswith("missing::"): + continue + sc, lat, cost, _ = t + scores.append(sc) + lats.append(lat) + costs.append(cost) + n = len(scores) + counts[combo_i] = n + if n == 0: + continue + mean_sc = sum(scores) / n + mean_lat = sum(lats) / n + finite_costs = [c for c in costs if c is not None and math.isfinite(c)] + mean_cost = ( + sum(finite_costs) / len(finite_costs) if finite_costs else None + ) + bonus = math.sqrt(a / n) + # Pessimistic gaps: high error, high latency, high cost. + score_lcb = max(0.0, min(1.0, mean_sc - bonus)) + lat_pess = mean_lat + bonus + cost_pess = ( + (mean_cost + bonus) if mean_cost is not None else None + ) + ne, nl, nc = selector._normalized_pareto_gaps( + score_lcb, lat_pess, cost_pess, + ) + use_cost = cost_pess is not None and math.isfinite(cost_pess) + bounds[combo_i] = chebyshev_scalar( + ne, nl, nc, weights, use_cost=use_cost, + ) + + fully_observed_combo = counts == n_datapoints + unobserved_combo = counts == 0 + bounds[fully_observed_combo] = np.inf + bounds[unobserved_combo] = -np.inf + if bool(np.all(fully_observed_combo)): + return None + best_combo = int(np.argmin(bounds)) + unobserved_dp = np.where(~observed_mask[best_combo])[0] + n_unobserved = int(unobserved_dp.size) + if n_unobserved <= 0: + return None + k = min(max(max_cells, 1), n_unobserved) + pick = rng.permutation(n_unobserved)[:k] + dps = unobserved_dp[pick] + combos = np.full(k, best_combo, dtype=np.int64) + return np.stack([combos, dps.astype(np.int64)]) + + +def _cell_reward( + selector: BaseModelSelector, + score: float, + latency: float, + cost: Optional[float], +) -> float: + """Scalar stored in the UCB matrix (higher is better for plain UCB).""" + if selector._pareto_exploration: + return -selector._chebyshev_cell_scalar(score, latency, cost) + return selector._combined_objective(score, latency, cost) + + def _build_selection_results( selector: BaseModelSelector, all_combos: List[Dict[str, ModelCandidate]], @@ -143,18 +232,7 @@ def _build_selection_results( ) ) - selector._finalize_combined_objectives(all_results) - best_info = selector._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for result in all_results: - if result.model_name == best_name: - result.is_best = True - break - else: - print("\n No combinations succeeded.") - - return SelectionResults(results=all_results) + return selector._finalize_selection_outcomes(all_results) def _record_cells( @@ -185,18 +263,13 @@ def _refresh_observed_np( observed: np.ndarray, cell_data: Dict[Tuple[int, int], Tuple[float, float, Optional[float], str]], ) -> None: - """Rewrite ``observed`` from ``cell_data`` against the selector's current normalizer. - - Called after absorbing new observations so plain-UCB row means use the - latest combined objective rather than stale values. No-op (preserves prior - contents) when no lambdas are configured. - """ - if not selector._has_combined_objective: + """Rewrite ``observed`` from ``cell_data`` against the selector's current normalizer.""" + if not selector._uses_matrix_scalar_refresh: return for (ci, di), (sc, lat, cost, dp_id) in cell_data.items(): if dp_id.startswith("missing::"): continue - observed[ci, di] = selector._combined_objective(sc, lat, cost) + observed[ci, di] = _cell_reward(selector, sc, lat, cost) def _refresh_observed_t( @@ -205,12 +278,12 @@ def _refresh_observed_t( cell_data: Dict[Tuple[int, int], Tuple[float, float, Optional[float], str]], ) -> None: """Torch counterpart of :func:`_refresh_observed_np` for the LRF variant.""" - if not selector._has_combined_objective: + if not selector._uses_matrix_scalar_refresh: return for (ci, di), (sc, lat, cost, dp_id) in cell_data.items(): if dp_id.startswith("missing::"): continue - observed_t[ci, di] = float(selector._combined_objective(sc, lat, cost)) + observed_t[ci, di] = float(_cell_reward(selector, sc, lat, cost)) class MatrixUCBModelSelector(BaseModelSelector): @@ -224,6 +297,9 @@ class MatrixUCBModelSelector(BaseModelSelector): here, **fraction of matrix cells** to observe) caps evaluations. ``1.0`` fills the full grid; ``0.1`` stops after about 10% of cells. If both are passed, ``sample_fraction`` wins. + + Requires ``objective_mode``: ``"weighted"`` (linear scalar + ``lambda_*``) or + ``"pareto"`` (Chebyshev exploration, frontier output). """ def __init__( @@ -238,6 +314,7 @@ def __init__( seed: Optional[int] = None, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -248,6 +325,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -286,10 +364,13 @@ async def _select_async(self, max_concurrent: int = 20) -> SelectionResults: ) total_cells = n_combos * n_datapoints + mode_label = ( + "Chebyshev Pareto" if self._pareto_exploration else "weighted scalar" + ) print(f"\n{'='*60}") print( - f"Matrix UCB (async): {n_combos} combinations × {n_datapoints} datapoints, " - f"a={self.a}, max {mc} concurrent" + f"Matrix UCB (async, {mode_label}): {n_combos} combinations × " + f"{n_datapoints} datapoints, a={self.a}, max {mc} concurrent" + ( f", observe up to {target_n}/{total_cells} cells " f"({self.observation_budget_fraction:.0%} budget)" @@ -327,7 +408,20 @@ async def _one( mc_step = min(mc, target_n - filled) if mc_step <= 0: break - batch = _ucb_plain_next_batch(observed, self.a, mc_step, self._rng) + observed_mask = ~np.isnan(observed) + if self._pareto_exploration: + batch = _ucb_chebyshev_next_batch( + self, + cell_data, + n_combos, + n_datapoints, + observed_mask, + self.a, + mc_step, + self._rng, + ) + else: + batch = _ucb_plain_next_batch(observed, self.a, mc_step, self._rng) if batch is None: break combo_row, dp_row = batch[0], batch[1] @@ -348,10 +442,12 @@ async def _one( costs = self._observe_combo(sc, lat, ids) if sc else [] _record_cells(cell_data, combo_i, dp_i, sc, lat, costs, ids) if sc: - observed[combo_i, dp_i] = self._combined_objective( - sc[0], lat[0], costs[0], + observed[combo_i, dp_i] = _cell_reward( + self, sc[0], lat[0], costs[0] if costs else None, ) _refresh_observed_np(self, observed, cell_data) + if self._pareto_exploration: + self._advance_chebyshev_direction() return _build_selection_results(self, all_combos, cell_data, n_datapoints) @@ -364,6 +460,8 @@ class MatrixUCBLRFModelSelector(BaseModelSelector): :class:`MatrixUCBModelSelector`). Warmup threshold: ``warmup_percentage`` or ``warmup_fraction`` — random probes until this **fraction of the full grid** is observed, then LRF+UCB (banditeval-style). + + Supports the same ``objective_mode`` values as :class:`MatrixUCBModelSelector`. """ def __init__( @@ -386,6 +484,7 @@ def __init__( seed: Optional[int] = None, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -413,6 +512,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -573,9 +673,13 @@ def _select_sequential(self, max_concurrent: int = 20) -> SelectionResults: ) if sc: observed_t[int(combo_i), int(dp_i)] = float( - self._combined_objective(sc[0], lat[0], costs[0]) + _cell_reward( + self, sc[0], lat[0], costs[0] if costs else None, + ) ) _refresh_observed_t(self, observed_t, cell_data) + if self._pareto_exploration: + self._advance_chebyshev_direction() return _build_selection_results(self, all_combos, cell_data, n_datapoints) @@ -671,8 +775,12 @@ async def _one( _record_cells(cell_data, combo_i, dp_i, sc, lat, costs, ids) if sc: observed_t[combo_i, dp_i] = float( - self._combined_objective(sc[0], lat[0], costs[0]) + _cell_reward( + self, sc[0], lat[0], costs[0] if costs else None, + ) ) _refresh_observed_t(self, observed_t, cell_data) + if self._pareto_exploration: + self._advance_chebyshev_direction() return _build_selection_results(self, all_combos, cell_data, n_datapoints) diff --git a/src/agentopt/model_selection/objectives.py b/src/agentopt/model_selection/objectives.py new file mode 100644 index 0000000..eedcd0d --- /dev/null +++ b/src/agentopt/model_selection/objectives.py @@ -0,0 +1,131 @@ +"""Multi-objective configuration helpers (weighted scalar vs Pareto exploration).""" + +from __future__ import annotations + +import math +from typing import List, Literal, Optional, Sequence, Tuple + +ObjectiveMode = Literal["weighted", "pareto"] + +# Fixed tradeoff directions on the (error, latency, cost) simplex — not user-facing. +CHEBYSHEV_WEIGHTS: Tuple[Tuple[float, float, float], ...] = ( + (1.0, 0.0, 0.0), + (0.0, 1.0, 0.0), + (0.0, 0.0, 1.0), + (1.0 / 3.0, 1.0 / 3.0, 1.0 / 3.0), + (0.5, 0.5, 0.0), + (0.5, 0.0, 0.5), + (0.0, 0.5, 0.5), + (0.6, 0.2, 0.2), + (0.2, 0.6, 0.2), + (0.2, 0.2, 0.6), + (0.25, 0.25, 0.5), + (0.25, 0.5, 0.25), +) + + +def validate_objective_config( + objective_mode: Optional[str], + lambda_cost: float, + lambda_latency: float, +) -> ObjectiveMode: + """Validate ``objective_mode`` and ``lambda_*``; return normalized mode.""" + if objective_mode is None: + raise ValueError( + "objective_mode is required: use 'weighted' (with lambda_cost and/or " + "lambda_latency > 0) or 'pareto' (omit lambdas; returns a Pareto frontier)." + ) + mode = str(objective_mode).strip().lower() + if mode not in ("weighted", "pareto"): + raise ValueError( + f"objective_mode must be 'weighted' or 'pareto', got {objective_mode!r}." + ) + lc = float(lambda_cost) + ll = float(lambda_latency) + if lc < 0 or ll < 0: + raise ValueError("lambda_cost and lambda_latency must be non-negative") + if mode == "weighted": + if lc <= 0.0 and ll <= 0.0: + raise ValueError( + "objective_mode='weighted' requires lambda_cost > 0 and/or " + "lambda_latency > 0." + ) + return "weighted" + if lc > 0.0 or ll > 0.0: + raise ValueError( + "objective_mode='pareto' does not accept lambda_cost or " + "lambda_latency; use objective_mode='weighted' instead." + ) + return "pareto" + + +def score_to_error(score: float) -> float: + """Map eval score (higher is better, typically in [0, 1]) to error (lower is better).""" + s = float(score) + if not math.isfinite(s): + return 1.0 + s = max(0.0, min(1.0, s)) + return 1.0 - s + + +def chebyshev_scalar( + norm_error: float, + norm_latency: float, + norm_cost: float, + weights: Tuple[float, float, float], + *, + use_cost: bool, +) -> float: + """Weighted Chebyshev achievement scalar (lower is better).""" + we, wl, wc = weights + terms = [we * norm_error, wl * norm_latency] + if use_cost: + terms.append(wc * norm_cost) + else: + # Renormalize error/latency weights when cost is absent. + s = we + wl + if s > 0: + terms = [we / s * norm_error, wl / s * norm_latency] + return max(terms) + + +def chebyshev_weight_at(step: int) -> Tuple[float, float, float]: + """Rotate through the fixed weight grid.""" + grid = CHEBYSHEV_WEIGHTS + return grid[int(step) % len(grid)] + + +def pareto_mask_3d( + errors: Sequence[float], + latencies: Sequence[float], + costs: Sequence[Optional[float]], +) -> List[bool]: + """Nondominated mask for minimize (error, latency, cost). + + When cost is ``None`` for a point, only error and latency are used for + dominance. Points without price are not compared on cost. + """ + n = len(errors) + mask = [True] * n + for i in range(n): + if not mask[i]: + continue + for j in range(n): + if i == j or not mask[j]: + continue + ci, cj = costs[i], costs[j] + use_cost = ci is not None and cj is not None + e_ok = errors[j] <= errors[i] + l_ok = latencies[j] <= latencies[i] + e_strict = errors[j] < errors[i] + l_strict = latencies[j] < latencies[i] + if use_cost: + c_ok = cj <= ci + c_strict = cj < ci + if e_ok and l_ok and c_ok and (e_strict or l_strict or c_strict): + mask[i] = False + break + elif e_ok and l_ok and (e_strict or l_strict): + mask[i] = False + break + return mask diff --git a/src/agentopt/model_selection/random_search.py b/src/agentopt/model_selection/random_search.py index 59b22ba..6e4b71c 100644 --- a/src/agentopt/model_selection/random_search.py +++ b/src/agentopt/model_selection/random_search.py @@ -30,6 +30,7 @@ def __init__( seed: Optional[int] = None, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -40,6 +41,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -113,19 +115,7 @@ def _select_sequential(self) -> SelectionResults: ) ) - self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for result in all_results: - if result.model_name == best_name: - result.is_best = True - break - else: - print("\n No sampled combinations succeeded") - - results = SelectionResults(results=all_results) - return results + return self._finalize_selection_outcomes(all_results) async def _select_async(self, max_concurrent: int = 20) -> SelectionResults: all_combos, sampled = self._get_sampled_combinations() @@ -182,16 +172,4 @@ async def _eval_combo( _, result = res all_results.append(result) - self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for r in all_results: - if r.model_name == best_name: - r.is_best = True - break - else: - print("\n No sampled combinations succeeded") - - results = SelectionResults(results=all_results) - return results + return self._finalize_selection_outcomes(all_results) diff --git a/src/agentopt/model_selection/threshold_successive_elimination.py b/src/agentopt/model_selection/threshold_successive_elimination.py index 5c60d48..a7d57e2 100644 --- a/src/agentopt/model_selection/threshold_successive_elimination.py +++ b/src/agentopt/model_selection/threshold_successive_elimination.py @@ -29,6 +29,7 @@ def __init__( confidence: float = 1.0, model_prices: Optional[Dict[str, Dict[str, float]]] = None, tracker=None, + objective_mode: Optional[str] = None, lambda_cost: float = 0.0, lambda_latency: float = 0.0, ) -> None: @@ -39,6 +40,7 @@ def __init__( dataset=dataset, model_prices=model_prices, tracker=tracker, + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -402,15 +404,4 @@ def _build_results( ) ) - self._finalize_combined_objectives(all_results) - best_info = self._find_best(all_results) - if best_info is not None: - best_name, _ = best_info - for result in all_results: - if result.model_name == best_name: - result.is_best = True - break - else: - print("\n No combinations succeeded.") - - return SelectionResults(results=all_results) + return self._finalize_selection_outcomes(all_results) diff --git a/tests/test_combined_objective.py b/tests/test_combined_objective.py index 3e7691a..c83b581 100644 --- a/tests/test_combined_objective.py +++ b/tests/test_combined_objective.py @@ -45,16 +45,19 @@ def _eval_fn(expected, actual): return 1.0 if str(expected).lower() in str(actual).lower() else 0.0 -def _selector(lambda_cost: float = 0.0, lambda_latency: float = 0.0): - """Build a BruteForceModelSelector with stub agent/eval/dataset. - - Used to access the BaseModelSelector helper methods under test. - """ +def _selector( + lambda_cost: float = 0.0, + lambda_latency: float = 0.1, + *, + objective_mode: str = "weighted", +): + """Build a BruteForceModelSelector with stub agent/eval/dataset.""" return BruteForceModelSelector( agent=_NoopAgent, models={"node": ["m"]}, eval_fn=_eval_fn, dataset=[("x", "ok")], + objective_mode=objective_mode, lambda_cost=lambda_cost, lambda_latency=lambda_latency, ) @@ -106,27 +109,27 @@ def test_linear_scaling(self): assert BruteForceModelSelector._minmax_norm(2.0, 1.0, 5.0) == pytest.approx(0.25) -class TestCombinedObjectiveZeroLambdas: - """When both lambdas are zero, behaviour must match raw accuracy.""" +class TestCombinedObjectiveParetoVsWeighted: + """Pareto mode skips linear scalar; weighted mode uses lambdas.""" - def test_returns_score_unchanged(self): - sel = _selector(0.0, 0.0) + def test_pareto_returns_score_from_combined_helper(self): + sel = _selector(0.0, 0.0, objective_mode="pareto") assert sel._combined_objective(0.42, 100.0, 5.0) == 0.42 - def test_has_combined_objective_false(self): - assert _selector(0.0, 0.0)._has_combined_objective is False - assert _selector(0.0, 0.1)._has_combined_objective is True - assert _selector(0.1, 0.0)._has_combined_objective is True + def test_has_combined_objective_by_mode(self): + assert _selector(0.0, 0.0, objective_mode="pareto")._has_combined_objective is False + assert _selector(0.0, 0.1, objective_mode="weighted")._has_combined_objective is True + assert _selector(0.1, 0.0, objective_mode="weighted")._has_combined_objective is True - def test_compute_objectives_returns_score_copy(self): - sel = _selector(0.0, 0.0) + def test_pareto_compute_objectives_returns_score_copy(self): + sel = _selector(0.0, 0.0, objective_mode="pareto") scores = [1.0, 0.0] out = sel._compute_objectives(scores, [10.0, 1.0], [0.5, 0.01]) assert out == scores - assert out is not scores # defensive copy + assert out is not scores - def test_mean_objective_none_without_lambdas(self): - sel = _selector(0.0, 0.0) + def test_pareto_mean_objective_none(self): + sel = _selector(0.0, 0.0, objective_mode="pareto") assert sel._mean_objective([1.0, 0.0], [1.0, 1.0], [0.0, 0.0]) is None @@ -139,12 +142,11 @@ def test_updates_running_min_max_for_lat_and_cost(self): assert sel._cost_min == 0.001 assert sel._cost_max == 0.005 - def test_noop_when_no_lambdas(self): - sel = _selector(0.0, 0.0) + def test_noop_in_pareto_absorb_observations_only(self): + sel = _selector(0.0, 0.0, objective_mode="pareto") sel._absorb_observations([1.0, 5.0], [0.001, 0.005]) - # Sentinel unchanged. - assert sel._latency_min == float("inf") - assert sel._cost_max == float("-inf") + assert sel._latency_min == 1.0 + assert sel._latency_max == 5.0 def test_skips_none_cost(self): sel = _selector(lambda_cost=0.1) @@ -219,8 +221,8 @@ def test_returns_none_for_empty_list(self): class TestFinalizeCombinedObjectives: - def test_noop_when_no_lambdas(self): - sel = _selector(0.0, 0.0) + def test_noop_when_pareto_mode(self): + sel = _selector(0.0, 0.0, objective_mode="pareto") r = _result("a", 1.0, 1.0, [_dp(0, 1.0, 1.0)]) sel._finalize_combined_objectives([r]) assert r.combined_objective is None @@ -272,19 +274,18 @@ def run(self, input_data): class TestBruteForceLatencyWeighting: - def test_zero_lambdas_preserves_accuracy_pick(self): + def test_pareto_mode_marks_frontier_not_single_best(self): sel = BruteForceModelSelector( agent=_LatencyTunedAgent, models={"node": ["fast", "slow"]}, eval_fn=_eval_fn, dataset=[("?", "correct"), ("?", "correct")], + objective_mode="pareto", ) results = sel.select_best(parallel=False) - best = results.get_best() - # Both are equally accurate; latency tiebreak picks 'fast'. - assert best is not None - assert best.model_name == "node=fast" - # combined_objective stays unset when no lambdas configured. + assert results.objective_mode == "pareto" + assert results.get_best() is None + assert len(results.get_pareto_front()) >= 1 assert all(r.combined_objective is None for r in results.results) def test_lambda_latency_picks_fast_when_accuracy_tied(self): @@ -293,6 +294,7 @@ def test_lambda_latency_picks_fast_when_accuracy_tied(self): models={"node": ["fast", "slow"]}, eval_fn=_eval_fn, dataset=[("?", "correct"), ("?", "correct")], + objective_mode="weighted", lambda_latency=0.5, ) results = sel.select_best(parallel=False) diff --git a/tests/test_objective_mode.py b/tests/test_objective_mode.py new file mode 100644 index 0000000..95f1787 --- /dev/null +++ b/tests/test_objective_mode.py @@ -0,0 +1,110 @@ +"""Tests for objective_mode (weighted vs pareto) and Pareto helpers.""" + +import pytest + +from agentopt.model_selection import BruteForceModelSelector +from agentopt.model_selection.base import DatapointResult, ModelResult, SelectionResults +from agentopt.model_selection.objectives import ( + pareto_mask_3d, + score_to_error, + validate_objective_config, +) + + +class _NoopAgent: + def __init__(self, models): + self.models = models + + def run(self, input_data): + return "ok" + + +def _eval_fn(expected, actual): + return 1.0 if str(expected).lower() in str(actual).lower() else 0.0 + + +class TestValidateObjectiveConfig: + def test_requires_mode(self): + with pytest.raises(ValueError, match="objective_mode is required"): + validate_objective_config(None, 0.0, 0.0) + + def test_weighted_requires_lambda(self): + with pytest.raises(ValueError, match="lambda"): + validate_objective_config("weighted", 0.0, 0.0) + + def test_pareto_rejects_lambdas(self): + with pytest.raises(ValueError, match="does not accept"): + validate_objective_config("pareto", 0.1, 0.0) + + def test_valid_modes(self): + assert validate_objective_config("weighted", 0.2, 0.0) == "weighted" + assert validate_objective_config("pareto", 0.0, 0.0) == "pareto" + + +class TestScoreToError: + def test_perfect_score_zero_error(self): + assert score_to_error(1.0) == 0.0 + + def test_zero_score_full_error(self): + assert score_to_error(0.0) == 1.0 + + +class TestParetoMask3d: + def test_nondominated_corner(self): + errors = [0.1, 0.5, 0.5] + lats = [1.0, 0.5, 2.0] + costs = [0.01, 0.02, 0.01] + mask = pareto_mask_3d(errors, lats, costs) + assert mask[0] is True + assert mask.count(True) >= 2 + + +class TestModelResultError: + def test_error_property(self): + r = ModelResult( + model_name="a", + accuracy=0.8, + latency_seconds=1.0, + attribute="combination", + ) + assert r.error == pytest.approx(0.2) + + +class TestSelectionResultsPareto: + def _results(self) -> SelectionResults: + a = ModelResult( + model_name="fast", + accuracy=0.9, + latency_seconds=1.0, + attribute="combination", + datapoint_results=[ + DatapointResult( + datapoint_index=0, score=0.9, latency_seconds=1.0, + ), + ], + ) + b = ModelResult( + model_name="slow", + accuracy=0.95, + latency_seconds=5.0, + attribute="combination", + datapoint_results=[ + DatapointResult( + datapoint_index=0, score=0.95, latency_seconds=5.0, + ), + ], + ) + sel = BruteForceModelSelector( + agent=_NoopAgent, + models={"n": ["a"]}, + eval_fn=_eval_fn, + dataset=[("x", "y")], + objective_mode="pareto", + ) + sel._mark_pareto_optimal([a, b]) + return SelectionResults(results=[a, b], objective_mode="pareto") + + def test_get_pareto_front(self): + res = self._results() + front = res.get_pareto_front() + assert len(front) >= 1