Skip to content

feat(experiment): cplex mip solver#773

Draft
michael-johnston wants to merge 38 commits intomainfrom
maj_cplex_mip
Draft

feat(experiment): cplex mip solver#773
michael-johnston wants to merge 38 commits intomainfrom
maj_cplex_mip

Conversation

@michael-johnston
Copy link
Copy Markdown
Member

@michael-johnston michael-johnston commented Mar 31, 2026

Adds experiment that can solve Mixed Integer Problems using CPLEX (community cplex by default which has limits on variables and constraints).

The experiment allows setting various CPLEX parameters and returns metrics like mip_gap, dual, primal along with timeseries of various quantities taken as the solver proceeds.

It also includes the to-be-merged changes in #670

The trainable only checked for critical errors at the very start of execution. Hence, if an effect of a critical error was to stop the measurement a trainable was waiting on to be delivered, it would never exit.
If a custom experiment execution results in InvalidMeasurementResult (indicating exception from a custom experiment or no results returned) the request should have status failed. Instead is had status success.
Long running experiments can exceed  the mysql connection idle time, causing the db connection to be closed. Once closed subsequently use of the connection results in failures.
The code assumed the entity contained the just measured property values, and hence it could be used to do the aggregation - however this was not the case.

At some stage the code was changed to work directly with the MeasurementRequests and updating this piece was missed.
numpy can't aggregate "None" in an array.

However, None is natural used as a placeholder if an experiment could not return a value for some reason. In this case those values should be dropped from the aggregation rather than raising a numpy error.
It is scoped to the current operation. Previous code used entity which could contain matching properties from other experiments.
…ndant TCP handshake

Without caching, every component that calls engine_for_sql_store for the same
database URL (the metastore SQLResourceStore and the samplestore SQLSampleStore
being the primary pair) each calls sqlalchemy.create_engine, which creates a
separate connection pool and therefore opens a separate TCP+TLS+MySQL auth
connection.

Over a high-latency link the full handshake can const 1-3. With two independent engines,
both the metastore and samplestore each paid this cost, adding an addition 1-3s to every CLI command.

_engine_cache (a process-level dict keyed by connection URL) ensures that the
second call for the same URL returns the already-initialised Engine and reuses
its warm connection pool, reducing the connection cost from 2x to 1x handshake per process lifetime.

Local measured saving on `ado show requests/results/entities`: ~2,360ms.
Previously, SQLSampleStore.__init__ called _create_source_table()
unconditionally on every construction.  _create_source_table() calls
meta.create_all(checkfirst=True), which issues a separate table-existence
query for each of the four backing tables even when
the tables already exist — which they always do on read-only CLI commands.
Locally this cost was ~1.45s. The changes reduce this to 0.24s.

The main fix is to probe for table existence instead of blinding issuing _create_source_tables(). This removes the 1.45 and adds the probe. We use a direct SQL probe (taking 240ms locally measured) instead of sqlalchemy.inspect() which took 800ms locally.

 The PR also introduces a process-level _source_tables_verified that records tablename prefixes whose backing tables have already been confirmed to exist.
  On the second and subsequent SQLSampleStore constructions for the same store within the same process the probe is skipped entirely (0 round-trips).
  The main impact of this is in the tests.
…or_to_pydantic

Previously, _measurement_requests_cursor_to_pydantic called
entityWithIdentifier in a loop for each entity not already in the local
cache, issuing one DB round-trip (~240ms) per unique entity. For an
operation touching N distinct entities this was an N+1 query pattern.
This is via a pattern of initializing the store an then calling measurement_requests_for_operation or related calls which happens with `ado show X operation`

It might be expected that self._entities is pre-populated before this method runs. There are two reasons it is not:

1. The bulk-load mechanism is bypassed. The public `entities` property
   triggers refresh(force_fetch_all_entities=True) when self._entities is
   empty, loading all entities in one query. But _measurement_requests_-
   cursor_to_pydantic accesses the private self._entities dict directly,
   never touching the property, so the lazy-load never fires.

2. Nothing in the construction path populates the cache. SQLSampleStore
   starts with self._entities = {}. By the time
   _measurement_requests_cursor_to_pydantic runs, the cache is still
   empty and every entity lookup falls through to entityWithIdentifier.

PyMySQL already buffers the full result set in memory before the Python
loop begins, so consuming the cursor up-front with fetchall() adds no
network round-trip. The fix uses this to collect all missing entity IDs
before any entity is resolved, then issues a single _fetch_entities
query for all of them at once regardless of N, ensuring the cache is populated
with the required entities.

The entityWithIdentifier fallback is retained and still fires for entities
added by a concurrent distributed process in the window between the batch
fetch and the cursor scan — the intended use of that code path.

Measured on an operation with 4 distinct entities (197ms RTT to DB):
  _measurement_requests_cursor_to_pydantic: 3,378ms → 859ms (-2,519ms)

Saving scales linearly with the number of distinct entities touched by
the operation: N round-trips collapsed to 1.
Profiling shows using sqlalchemy.inspect to probe table existence is 3x more expensive (750ms locally) than a direct SQL probe (250ms).
…B lookup

_perform_preflight_checks_for_sample_store_methods decorates every
sample store method (measurement_requests_for_operation,
complete_measurement_request_with_results_timeseries, etc.) and
verifies that the supplied operation_id belongs to the space before
allowing the call through. It did this by calling getResource(operation)
on every decorated call — a full DB round-trip each time.

For the common path for `ado show X operation` which goes through from_operation_id, this check is redundant as
the operation was already fetched to locate the space, so ownership is
proven by construction.

The fix adds _verified_operation_ids: set[str] to DiscoverySpace.__init__
and guards the getResource call in the preflight decorator behind a set
membership test. from_operation_id pre-populates the set with the
operation_id immediately after the space is built, so any subsequent
decorated call on that space skips the DB round-trip entirely.

The full ownership check (getResource + uri comparison) is still
executed for operation IDs not in the set, preserving correctness for
spaces constructed via other paths (e.g. from_stored_configuration
called directly) where ownership has not yet been established.

Measured saving per decorated method call: ~653ms (this is saving in ado show X operation call)
…d-only paths

from_configuration unconditionally called experimentCatalog() on every
construction of a DiscoverySpace. This method queries the samplestore for
any externally registered experiment plugins and loads them into the
actuator registry so they are available for replay during an active
operation run.

For read-only CLI commands (show requests, show results, show entities)
this work is entirely unnecessary: these commands inspect stored
measurement data and never resolve or replay experiment references.
The catalog load was measured at ~852ms — a full DB query round-trip
plus plugin loading overhead — paid on every invocation with no benefit.

The commit adds a boolean load_experiment_catalog parameter to __init__ (default True), from_configuration and from_stored_configuration. from_operation_id,
whose callers are exclusively read-only show commands, passes
load_experiment_catalog=False, bypassing the experimentCatalog() call
and the actuator registry update entirely.

The default remains True so all other construction paths (from_configuration
called directly, space creation during an active run) retain the existing
behaviour without requiring changes at call sites.

Measured saving on ado show requests / show results / show entities: ~852ms.
…_resource_and_producers

from_operation_id previously made three sequential getResource calls to
assemble a DiscoverySpace from an operation ID:

  1. getResource(operation_id, OPERATION)   → extract spaces[0]
  2. getResource(space_id, DISCOVERYSPACE)  → extract sampleStoreIdentifier
  3. getResource(samplestore_id, SAMPLESTORE)

Each call is a separate network round-trip (~240ms RTT each, plus
SQLAlchemy query overhead). Because each call depends on the identifier
returned by the previous one they cannot be parallelised; total cost was
measured at ~2,520ms (1,194ms + 704ms + 622ms).

A new SQLResourceStore method, get_resource_and_producers, replaces the
three calls with a single SQL statement that walks the parent-child
identifier chain using JSON path extraction (data->>'$.path').

The method is generic: callers supply the root resource and an ordered
chain of (json_path, kind) pairs, making it reusable for other
parent-child resource hierarchies.

from_operation_id passes the three fetched resources directly into
from_stored_configuration and from_configuration via new optional
parameters (space_resource, samplestore_resource), so neither method
re-fetches what has already been retrieved.

Measured saving on ado show requests / show results: ~1.9s
(three sequential round-trips collapsed to one query).
This is across multi-values metrics returned by a single experiment.
…hly-built source wheels

- Add `identify_and_copy_local_wheels` to detect local wheel paths in the
  `fromPyPI` list, copy them to the Ray working directory, and rewrite
  entries to use `RAY_RUNTIME_ENV_CREATE_WORKING_DIR` so wheels are
  available on all Ray nodes (not just the head node).
- Add `_newest_source_mtime` helper and freshness checks in
  `_build_source_wheels` to raise an error when a cached/reused wheel is
  detected instead of a freshly-built one, preventing stale wheels from
  being shipped between remote executions.
- Surface wheel-build progress to the console via `console_print` and
  pause the spinner during the build step for cleaner output.
- Thread `cwd`, `working_dir`, and `seen_basenames` through to
  `_write_runtime_env` to support the local-wheel copy logic.

logger = logging.getLogger(__name__)

DEFAULT_MPS_FILE = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be our default

"description": (
"CPLEX time limit per seed run in seconds (CPX_PARAM_TILIM). "
"Default is 1e75 (no limit); CPLEX runs until the optimal solution is found. "
"No domain range is enforced — any positive value including 1e75 is accepted."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we check somewhere that it's positive?



def _align_to_grid(
samples: list[dict[str, Any]],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very surprised ruff didn't complain about Any being used here...


from orchestrator.schema.experiment import Experiment

DEFAULT_MPS = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a better default, if we need one

description = "ado custom experiment for benchmarking CPLEX MIP solver performance across parameter configurations"
dependencies = [
"ado-core",
# "cplex",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed or not?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you need it - the reason it's commented here is so after upgrading it (removing community restrictions), the upgraded version doesn't get overwritten by a pip install.

…ve_mip.py

Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com>
Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants