feat(experiment): cplex mip solver by michael-johnston · Pull Request #773 · IBM/ado

michael-johnston · 2026-03-31T19:03:11Z

Adds experiment that can solve Mixed Integer Problems using CPLEX (community cplex by default which has limits on variables and constraints).

The experiment allows setting various CPLEX parameters and returns metrics like mip_gap, dual, primal along with timeseries of various quantities taken as the solver proceeds.

It also includes the to-be-merged changes in #670

The trainable only checked for critical errors at the very start of execution. Hence, if an effect of a critical error was to stop the measurement a trainable was waiting on to be delivered, it would never exit.

If a custom experiment execution results in InvalidMeasurementResult (indicating exception from a custom experiment or no results returned) the request should have status failed. Instead is had status success.

Long running experiments can exceed the mysql connection idle time, causing the db connection to be closed. Once closed subsequently use of the connection results in failures.

The code assumed the entity contained the just measured property values, and hence it could be used to do the aggregation - however this was not the case. At some stage the code was changed to work directly with the MeasurementRequests and updating this piece was missed.

numpy can't aggregate "None" in an array. However, None is natural used as a placeholder if an experiment could not return a value for some reason. In this case those values should be dropped from the aggregation rather than raising a numpy error.

It is scoped to the current operation. Previous code used entity which could contain matching properties from other experiments.

…ndant TCP handshake Without caching, every component that calls engine_for_sql_store for the same database URL (the metastore SQLResourceStore and the samplestore SQLSampleStore being the primary pair) each calls sqlalchemy.create_engine, which creates a separate connection pool and therefore opens a separate TCP+TLS+MySQL auth connection. Over a high-latency link the full handshake can const 1-3. With two independent engines, both the metastore and samplestore each paid this cost, adding an addition 1-3s to every CLI command. _engine_cache (a process-level dict keyed by connection URL) ensures that the second call for the same URL returns the already-initialised Engine and reuses its warm connection pool, reducing the connection cost from 2x to 1x handshake per process lifetime. Local measured saving on `ado show requests/results/entities`: ~2,360ms.

Rather than Nan

Previously, SQLSampleStore.__init__ called _create_source_table() unconditionally on every construction. _create_source_table() calls meta.create_all(checkfirst=True), which issues a separate table-existence query for each of the four backing tables even when the tables already exist — which they always do on read-only CLI commands. Locally this cost was ~1.45s. The changes reduce this to 0.24s. The main fix is to probe for table existence instead of blinding issuing _create_source_tables(). This removes the 1.45 and adds the probe. We use a direct SQL probe (taking 240ms locally measured) instead of sqlalchemy.inspect() which took 800ms locally. The PR also introduces a process-level _source_tables_verified that records tablename prefixes whose backing tables have already been confirmed to exist. On the second and subsequent SQLSampleStore constructions for the same store within the same process the probe is skipped entirely (0 round-trips). The main impact of this is in the tests.

…or_to_pydantic Previously, _measurement_requests_cursor_to_pydantic called entityWithIdentifier in a loop for each entity not already in the local cache, issuing one DB round-trip (~240ms) per unique entity. For an operation touching N distinct entities this was an N+1 query pattern. This is via a pattern of initializing the store an then calling measurement_requests_for_operation or related calls which happens with `ado show X operation` It might be expected that self._entities is pre-populated before this method runs. There are two reasons it is not: 1. The bulk-load mechanism is bypassed. The public `entities` property triggers refresh(force_fetch_all_entities=True) when self._entities is empty, loading all entities in one query. But _measurement_requests_- cursor_to_pydantic accesses the private self._entities dict directly, never touching the property, so the lazy-load never fires. 2. Nothing in the construction path populates the cache. SQLSampleStore starts with self._entities = {}. By the time _measurement_requests_cursor_to_pydantic runs, the cache is still empty and every entity lookup falls through to entityWithIdentifier. PyMySQL already buffers the full result set in memory before the Python loop begins, so consuming the cursor up-front with fetchall() adds no network round-trip. The fix uses this to collect all missing entity IDs before any entity is resolved, then issues a single _fetch_entities query for all of them at once regardless of N, ensuring the cache is populated with the required entities. The entityWithIdentifier fallback is retained and still fires for entities added by a concurrent distributed process in the window between the batch fetch and the cursor scan — the intended use of that code path. Measured on an operation with 4 distinct entities (197ms RTT to DB): _measurement_requests_cursor_to_pydantic: 3,378ms → 859ms (-2,519ms) Saving scales linearly with the number of distinct entities touched by the operation: N round-trips collapsed to 1.

Profiling shows using sqlalchemy.inspect to probe table existence is 3x more expensive (750ms locally) than a direct SQL probe (250ms).

…B lookup _perform_preflight_checks_for_sample_store_methods decorates every sample store method (measurement_requests_for_operation, complete_measurement_request_with_results_timeseries, etc.) and verifies that the supplied operation_id belongs to the space before allowing the call through. It did this by calling getResource(operation) on every decorated call — a full DB round-trip each time. For the common path for `ado show X operation` which goes through from_operation_id, this check is redundant as the operation was already fetched to locate the space, so ownership is proven by construction. The fix adds _verified_operation_ids: set[str] to DiscoverySpace.__init__ and guards the getResource call in the preflight decorator behind a set membership test. from_operation_id pre-populates the set with the operation_id immediately after the space is built, so any subsequent decorated call on that space skips the DB round-trip entirely. The full ownership check (getResource + uri comparison) is still executed for operation IDs not in the set, preserving correctness for spaces constructed via other paths (e.g. from_stored_configuration called directly) where ownership has not yet been established. Measured saving per decorated method call: ~653ms (this is saving in ado show X operation call)

…d-only paths from_configuration unconditionally called experimentCatalog() on every construction of a DiscoverySpace. This method queries the samplestore for any externally registered experiment plugins and loads them into the actuator registry so they are available for replay during an active operation run. For read-only CLI commands (show requests, show results, show entities) this work is entirely unnecessary: these commands inspect stored measurement data and never resolve or replay experiment references. The catalog load was measured at ~852ms — a full DB query round-trip plus plugin loading overhead — paid on every invocation with no benefit. The commit adds a boolean load_experiment_catalog parameter to __init__ (default True), from_configuration and from_stored_configuration. from_operation_id, whose callers are exclusively read-only show commands, passes load_experiment_catalog=False, bypassing the experimentCatalog() call and the actuator registry update entirely. The default remains True so all other construction paths (from_configuration called directly, space creation during an active run) retain the existing behaviour without requiring changes at call sites. Measured saving on ado show requests / show results / show entities: ~852ms.

…_resource_and_producers from_operation_id previously made three sequential getResource calls to assemble a DiscoverySpace from an operation ID: 1. getResource(operation_id, OPERATION) → extract spaces[0] 2. getResource(space_id, DISCOVERYSPACE) → extract sampleStoreIdentifier 3. getResource(samplestore_id, SAMPLESTORE) Each call is a separate network round-trip (~240ms RTT each, plus SQLAlchemy query overhead). Because each call depends on the identifier returned by the previous one they cannot be parallelised; total cost was measured at ~2,520ms (1,194ms + 704ms + 622ms). A new SQLResourceStore method, get_resource_and_producers, replaces the three calls with a single SQL statement that walks the parent-child identifier chain using JSON path extraction (data->>'$.path'). The method is generic: callers supply the root resource and an ordered chain of (json_path, kind) pairs, making it reusable for other parent-child resource hierarchies. from_operation_id passes the three fetched resources directly into from_stored_configuration and from_configuration via new optional parameters (space_resource, samplestore_resource), so neither method re-fetches what has already been retrieved. Measured saving on ado show requests / show results: ~1.9s (three sequential round-trips collapsed to one query).

This is across multi-values metrics returned by a single experiment.

…hly-built source wheels - Add `identify_and_copy_local_wheels` to detect local wheel paths in the `fromPyPI` list, copy them to the Ray working directory, and rewrite entries to use `RAY_RUNTIME_ENV_CREATE_WORKING_DIR` so wheels are available on all Ray nodes (not just the head node). - Add `_newest_source_mtime` helper and freshness checks in `_build_source_wheels` to raise an error when a cached/reused wheel is detected instead of a freshly-built one, preventing stale wheels from being shipped between remote executions. - Surface wheel-build progress to the console via `console_print` and pause the spinner during the build step for cleaner output. - Thread `cwd`, `working_dir`, and `seen_basenames` through to `_write_runtime_env` to support the local-wheel copy logic.

AlessandroPomponio · 2026-04-01T13:15:06Z

plugins/custom_experiments/cplex_mip/cplex_mip_experiments/solve_mip.py

+
+logger = logging.getLogger(__name__)
+
+DEFAULT_MPS_FILE = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz"


This probably shouldn't be our default

AlessandroPomponio · 2026-04-01T13:18:29Z

plugins/custom_experiments/cplex_mip/cplex_mip_experiments/solve_mip.py

+        "description": (
+            "CPLEX time limit per seed run in seconds (CPX_PARAM_TILIM). "
+            "Default is 1e75 (no limit); CPLEX runs until the optimal solution is found. "
+            "No domain range is enforced — any positive value including 1e75 is accepted."


do we check somewhere that it's positive?

plugins/custom_experiments/cplex_mip/cplex_mip_experiments/solve_mip.py

AlessandroPomponio · 2026-04-01T13:34:29Z

plugins/custom_experiments/cplex_mip/cplex_mip_experiments/solve_mip.py

+
+
+def _align_to_grid(
+    samples: list[dict[str, Any]],


I'm very surprised ruff didn't complain about Any being used here...

AlessandroPomponio · 2026-04-01T13:50:29Z

plugins/custom_experiments/cplex_mip/tests/test_solve_mip.py

+
+from orchestrator.schema.experiment import Experiment
+
+DEFAULT_MPS = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz"


We should have a better default, if we need one

AlessandroPomponio · 2026-04-01T13:55:05Z

plugins/custom_experiments/cplex_mip/pyproject.toml

+description = "ado custom experiment for benchmarking CPLEX MIP solver performance across parameter configurations"
+dependencies = [
+  "ado-core",
+  # "cplex",


is this needed or not?

Yes you need it - the reason it's commented here is so after upgrading it (removing community restrictions), the upgraded version doesn't get overwritten by a pip install.

…ve_mip.py Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com> Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>

michael-johnston added 30 commits March 3, 2026 12:02

feat(experiment): mip solving with cplex

d0bd0b3

Merge branch 'main' into maj_cplex_mip

dafa22d

Merge remote-tracking branch 'origin/main' into maj_cplex_mip

b109427

fix(ray_tune): Poll for critical errors

6c12209

The trainable only checked for critical errors at the very start of execution. Hence, if an effect of a critical error was to stop the measurement a trainable was waiting on to be delivered, it would never exit.

fix(custom_experiments): incorrect status when InvalidMeasurementResult

e224ad4

If a custom experiment execution results in InvalidMeasurementResult (indicating exception from a custom experiment or no results returned) the request should have status failed. Instead is had status success.

fix(core): prevent disconnects with long experiments

b4a9ad9

Long running experiments can exceed the mysql connection idle time, causing the db connection to be closed. Once closed subsequently use of the connection results in failures.

fix(ray_tune): use measurementspace to get virtual properties

814903c

It is scoped to the current operation. Previous code used entity which could contain matching properties from other experiments.

test(schema): replace mocks with fixtures

3069874

fix(cplex): Return None for unknown values

2ca241e

Rather than Nan

feat(perf): replace sqlalchmey.inspect with direct probe

0000422

Profiling shows using sqlalchemy.inspect to probe table existence is 3x more expensive (750ms locally) than a direct SQL probe (250ms).

Merge branch 'main' into maj_cplex_mip

517b476

fix(core): Missing cache addition in last commit

5131ce0

fix(samplestore): iterate the fetched rows

59e685b

fix(core): Missing cache addition in last commit

321cdbb

fix(core): Missing cache addition in last commit

0826d57

Merge branch 'main' into maj_cplex_mip

da0d1ab

feat(core): enable aggregation in show operations

081b9c4

This is across multi-values metrics returned by a single experiment.

chore(docs): update comment

c0888ef

feat(cplex): add intermediate metrics

1ecfd6a

build(cplex): remove workspace

c64100c

fix(ray_tune): Handle if cause attribute not present

72e0f49

michael-johnston added 7 commits March 16, 2026 15:22

Merge remote-tracking branch 'origin/main' into maj_cplex_mip

18808d3

Merge remote-tracking branch 'origin/main' into maj_cplex_mip

d4c4944

Merge remote-tracking branch 'origin/main' into maj_cplex_mip

50a05c1

Merge remote-tracking branch 'origin/main' into maj_cplex_mip

b98395b

Merge remote-tracking branch 'origin' into maj_cplex_mip

e356325

fix(docs): fix parameter docs

40ad05b

Merge remote-tracking branch 'origin/main' into maj_cplex_mip

d8be1ae

AlessandroPomponio reviewed Apr 1, 2026

View reviewed changes

Update plugins/custom_experiments/cplex_mip/cplex_mip_experiments/sol…

84a9702

…ve_mip.py Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com> Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(experiment): cplex mip solver#773

feat(experiment): cplex mip solver#773
michael-johnston wants to merge 38 commits intomainfrom
maj_cplex_mip

michael-johnston commented Mar 31, 2026 •

edited

Loading

Uh oh!

AlessandroPomponio Apr 1, 2026

Uh oh!

AlessandroPomponio Apr 1, 2026

Uh oh!

Uh oh!

AlessandroPomponio Apr 1, 2026

Uh oh!

AlessandroPomponio Apr 1, 2026

Uh oh!

AlessandroPomponio Apr 1, 2026

Uh oh!

michael-johnston Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		logger = logging.getLogger(__name__)

		DEFAULT_MPS_FILE = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz"


		from orchestrator.schema.experiment import Experiment

		DEFAULT_MPS = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz"

Conversation

michael-johnston commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlessandroPomponio Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

AlessandroPomponio Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlessandroPomponio Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

AlessandroPomponio Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

AlessandroPomponio Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

michael-johnston Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michael-johnston commented Mar 31, 2026 •

edited

Loading