Draft
Conversation
The trainable only checked for critical errors at the very start of execution. Hence, if an effect of a critical error was to stop the measurement a trainable was waiting on to be delivered, it would never exit.
If a custom experiment execution results in InvalidMeasurementResult (indicating exception from a custom experiment or no results returned) the request should have status failed. Instead is had status success.
Long running experiments can exceed the mysql connection idle time, causing the db connection to be closed. Once closed subsequently use of the connection results in failures.
The code assumed the entity contained the just measured property values, and hence it could be used to do the aggregation - however this was not the case. At some stage the code was changed to work directly with the MeasurementRequests and updating this piece was missed.
numpy can't aggregate "None" in an array. However, None is natural used as a placeholder if an experiment could not return a value for some reason. In this case those values should be dropped from the aggregation rather than raising a numpy error.
It is scoped to the current operation. Previous code used entity which could contain matching properties from other experiments.
…ndant TCP handshake Without caching, every component that calls engine_for_sql_store for the same database URL (the metastore SQLResourceStore and the samplestore SQLSampleStore being the primary pair) each calls sqlalchemy.create_engine, which creates a separate connection pool and therefore opens a separate TCP+TLS+MySQL auth connection. Over a high-latency link the full handshake can const 1-3. With two independent engines, both the metastore and samplestore each paid this cost, adding an addition 1-3s to every CLI command. _engine_cache (a process-level dict keyed by connection URL) ensures that the second call for the same URL returns the already-initialised Engine and reuses its warm connection pool, reducing the connection cost from 2x to 1x handshake per process lifetime. Local measured saving on `ado show requests/results/entities`: ~2,360ms.
Rather than Nan
Previously, SQLSampleStore.__init__ called _create_source_table() unconditionally on every construction. _create_source_table() calls meta.create_all(checkfirst=True), which issues a separate table-existence query for each of the four backing tables even when the tables already exist — which they always do on read-only CLI commands. Locally this cost was ~1.45s. The changes reduce this to 0.24s. The main fix is to probe for table existence instead of blinding issuing _create_source_tables(). This removes the 1.45 and adds the probe. We use a direct SQL probe (taking 240ms locally measured) instead of sqlalchemy.inspect() which took 800ms locally. The PR also introduces a process-level _source_tables_verified that records tablename prefixes whose backing tables have already been confirmed to exist. On the second and subsequent SQLSampleStore constructions for the same store within the same process the probe is skipped entirely (0 round-trips). The main impact of this is in the tests.
…or_to_pydantic
Previously, _measurement_requests_cursor_to_pydantic called
entityWithIdentifier in a loop for each entity not already in the local
cache, issuing one DB round-trip (~240ms) per unique entity. For an
operation touching N distinct entities this was an N+1 query pattern.
This is via a pattern of initializing the store an then calling measurement_requests_for_operation or related calls which happens with `ado show X operation`
It might be expected that self._entities is pre-populated before this method runs. There are two reasons it is not:
1. The bulk-load mechanism is bypassed. The public `entities` property
triggers refresh(force_fetch_all_entities=True) when self._entities is
empty, loading all entities in one query. But _measurement_requests_-
cursor_to_pydantic accesses the private self._entities dict directly,
never touching the property, so the lazy-load never fires.
2. Nothing in the construction path populates the cache. SQLSampleStore
starts with self._entities = {}. By the time
_measurement_requests_cursor_to_pydantic runs, the cache is still
empty and every entity lookup falls through to entityWithIdentifier.
PyMySQL already buffers the full result set in memory before the Python
loop begins, so consuming the cursor up-front with fetchall() adds no
network round-trip. The fix uses this to collect all missing entity IDs
before any entity is resolved, then issues a single _fetch_entities
query for all of them at once regardless of N, ensuring the cache is populated
with the required entities.
The entityWithIdentifier fallback is retained and still fires for entities
added by a concurrent distributed process in the window between the batch
fetch and the cursor scan — the intended use of that code path.
Measured on an operation with 4 distinct entities (197ms RTT to DB):
_measurement_requests_cursor_to_pydantic: 3,378ms → 859ms (-2,519ms)
Saving scales linearly with the number of distinct entities touched by
the operation: N round-trips collapsed to 1.
Profiling shows using sqlalchemy.inspect to probe table existence is 3x more expensive (750ms locally) than a direct SQL probe (250ms).
…B lookup _perform_preflight_checks_for_sample_store_methods decorates every sample store method (measurement_requests_for_operation, complete_measurement_request_with_results_timeseries, etc.) and verifies that the supplied operation_id belongs to the space before allowing the call through. It did this by calling getResource(operation) on every decorated call — a full DB round-trip each time. For the common path for `ado show X operation` which goes through from_operation_id, this check is redundant as the operation was already fetched to locate the space, so ownership is proven by construction. The fix adds _verified_operation_ids: set[str] to DiscoverySpace.__init__ and guards the getResource call in the preflight decorator behind a set membership test. from_operation_id pre-populates the set with the operation_id immediately after the space is built, so any subsequent decorated call on that space skips the DB round-trip entirely. The full ownership check (getResource + uri comparison) is still executed for operation IDs not in the set, preserving correctness for spaces constructed via other paths (e.g. from_stored_configuration called directly) where ownership has not yet been established. Measured saving per decorated method call: ~653ms (this is saving in ado show X operation call)
…d-only paths from_configuration unconditionally called experimentCatalog() on every construction of a DiscoverySpace. This method queries the samplestore for any externally registered experiment plugins and loads them into the actuator registry so they are available for replay during an active operation run. For read-only CLI commands (show requests, show results, show entities) this work is entirely unnecessary: these commands inspect stored measurement data and never resolve or replay experiment references. The catalog load was measured at ~852ms — a full DB query round-trip plus plugin loading overhead — paid on every invocation with no benefit. The commit adds a boolean load_experiment_catalog parameter to __init__ (default True), from_configuration and from_stored_configuration. from_operation_id, whose callers are exclusively read-only show commands, passes load_experiment_catalog=False, bypassing the experimentCatalog() call and the actuator registry update entirely. The default remains True so all other construction paths (from_configuration called directly, space creation during an active run) retain the existing behaviour without requiring changes at call sites. Measured saving on ado show requests / show results / show entities: ~852ms.
…_resource_and_producers from_operation_id previously made three sequential getResource calls to assemble a DiscoverySpace from an operation ID: 1. getResource(operation_id, OPERATION) → extract spaces[0] 2. getResource(space_id, DISCOVERYSPACE) → extract sampleStoreIdentifier 3. getResource(samplestore_id, SAMPLESTORE) Each call is a separate network round-trip (~240ms RTT each, plus SQLAlchemy query overhead). Because each call depends on the identifier returned by the previous one they cannot be parallelised; total cost was measured at ~2,520ms (1,194ms + 704ms + 622ms). A new SQLResourceStore method, get_resource_and_producers, replaces the three calls with a single SQL statement that walks the parent-child identifier chain using JSON path extraction (data->>'$.path'). The method is generic: callers supply the root resource and an ordered chain of (json_path, kind) pairs, making it reusable for other parent-child resource hierarchies. from_operation_id passes the three fetched resources directly into from_stored_configuration and from_configuration via new optional parameters (space_resource, samplestore_resource), so neither method re-fetches what has already been retrieved. Measured saving on ado show requests / show results: ~1.9s (three sequential round-trips collapsed to one query).
This is across multi-values metrics returned by a single experiment.
…hly-built source wheels - Add `identify_and_copy_local_wheels` to detect local wheel paths in the `fromPyPI` list, copy them to the Ray working directory, and rewrite entries to use `RAY_RUNTIME_ENV_CREATE_WORKING_DIR` so wheels are available on all Ray nodes (not just the head node). - Add `_newest_source_mtime` helper and freshness checks in `_build_source_wheels` to raise an error when a cached/reused wheel is detected instead of a freshly-built one, preventing stale wheels from being shipped between remote executions. - Surface wheel-build progress to the console via `console_print` and pause the spinner during the build step for cleaner output. - Thread `cwd`, `working_dir`, and `seen_basenames` through to `_write_runtime_env` to support the local-wheel copy logic.
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| DEFAULT_MPS_FILE = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz" |
Member
There was a problem hiding this comment.
This probably shouldn't be our default
| "description": ( | ||
| "CPLEX time limit per seed run in seconds (CPX_PARAM_TILIM). " | ||
| "Default is 1e75 (no limit); CPLEX runs until the optimal solution is found. " | ||
| "No domain range is enforced — any positive value including 1e75 is accepted." |
Member
There was a problem hiding this comment.
do we check somewhere that it's positive?
|
|
||
|
|
||
| def _align_to_grid( | ||
| samples: list[dict[str, Any]], |
Member
There was a problem hiding this comment.
I'm very surprised ruff didn't complain about Any being used here...
|
|
||
| from orchestrator.schema.experiment import Experiment | ||
|
|
||
| DEFAULT_MPS = "/Users/michaelj/tmp/miplib_2017_benchmarks/bab6.mps.gz" |
Member
There was a problem hiding this comment.
We should have a better default, if we need one
| description = "ado custom experiment for benchmarking CPLEX MIP solver performance across parameter configurations" | ||
| dependencies = [ | ||
| "ado-core", | ||
| # "cplex", |
Member
There was a problem hiding this comment.
is this needed or not?
Member
Author
There was a problem hiding this comment.
Yes you need it - the reason it's commented here is so after upgrading it (removing community restrictions), the upgraded version doesn't get overwritten by a pip install.
…ve_mip.py Co-authored-by: Alessandro Pomponio <10339005+AlessandroPomponio@users.noreply.github.com> Signed-off-by: Michael Johnston <66301584+michael-johnston@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds experiment that can solve Mixed Integer Problems using CPLEX (community cplex by default which has limits on variables and constraints).
The experiment allows setting various CPLEX parameters and returns metrics like mip_gap, dual, primal along with timeseries of various quantities taken as the solver proceeds.
It also includes the to-be-merged changes in #670