Skip to content

[Autoscaler][v2] Skip instances of unknown type in AutoscalerMetricsReporter to avoid KeyError#64250

Open
daiping8 wants to merge 2 commits into
ray-project:masterfrom
daiping8:fix/autoscaler-keyerror-unknown-instance-type
Open

[Autoscaler][v2] Skip instances of unknown type in AutoscalerMetricsReporter to avoid KeyError#64250
daiping8 wants to merge 2 commits into
ray-project:masterfrom
daiping8:fix/autoscaler-keyerror-unknown-instance-type

Conversation

@daiping8

@daiping8 daiping8 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Description

In KubeRay deployments running autoscaler v2, a RayWorkerGroup CR can be removed dynamically (e.g. by editing the RayCluster spec). Instances that were created under that group linger in the Instance Manager state until they reach TERMINATED, but their instance_type is no longer present in the active node_type_configs once the config is rebuilt from the updated spec.

The bug

AutoscalerMetricsReporter.report_instances and report_resources (in python/ray/autoscaler/v2/metrics_reporter.py) assume a closed-world invariant: every instance.instance_type is a key in the current node_type_configs. That invariant does not hold during the drain window of a removed worker group:

  • report_instances initializes status_count_by_type only from node_type_configs.keys(), then unconditionally does
    status_count_by_type[instance.instance_type][...] += 1
  • report_resources calls _add_resources(..., node_type_configs, instance.instance_type, 1), which dereferences
    node_type_configs[node_type]

So the next reporter tick raises:

KeyError: '<removed-group-name>'

Impact

The exception propagates out of the reporting loops and aborts the rest of that tick's metric updates. As a result, the following Prometheus gauges go stale for every node type (not just the removed one) until the lingering instances are fully garbage-collected:

  • autoscaler_pending_nodes
  • autoscaler_active_nodes
  • autoscaler_recently_failed_nodes
  • autoscaler_pending_resources
  • autoscaler_cluster_resources

Operators see missing or stale autoscaler metrics precisely during the scale-down event they most want visibility into, and may also see autoscaler error-loop noise from the recurring KeyError.

Reproduction

  1. Deploy a KubeRay cluster with autoscaler v2 enabled (RAY_enable_autoscaler_v2=1).
  2. Configure the RayCluster with at least one worker group, e.g. worker-group-a.
  3. Scale worker-group-a up so at least one instance reaches a non-terminal state (e.g. ALLOCATED, RAY_INSTALLING, RAY_RUNNING, TERMINATING).
  4. Remove worker-group-a from the RayCluster spec (dynamic CR update), which drops "worker-group-a" from node_type_configs.
  5. Wait for the next autoscaler tick → KeyError: 'worker-group-a' is raised inside report_instances / report_resources, and the gauges above are not updated for that tick.

The fix

Treat an unknown instance_type as a transient condition rather than a hard error:

  1. New helper _filter_active_instances — a @staticmethod on AutoscalerMetricsReporter that returns only instances whose instance_type is still present in node_type_configs, logging at INFO for each skipped instance (so operators can still see the drain happening):
    logger.info(
        "Skipping instance %s with unknown type %r, active types: %s",
        instance.instance_id,
        instance.instance_type,
        list(active_types.keys()),
    )
  2. Apply the filter in both reporting pathsreport_instances and report_resources now run instances = self._filter_active_instances(instances, node_type_configs) before their counting/aggregation loops, so a removed node type can no longer poison the loop.

Verification

Added test_report_skips_unknown_instance_types in python/ray/autoscaler/v2/tests/test_metrics_reporter.py:

  • Constructs node_type_configs containing only type_1.
  • Builds an instance list mixing type_1 and a removed_type instance.
  • Calls both report_instances and report_resources; both must complete without raising KeyError.
  • Asserts the resulting Prometheus samples for type_1 reflect only the type_1 instances (1 active, 1 pending, 1 CPU cluster resource, 1 CPU pending resource) and that the removed_type instance is silently dropped.
# Run the metrics reporter tests.
pytest -xvs python/ray/autoscaler/v2/tests/test_metrics_reporter.py

When a RayWorkerGroup CR is dynamically removed, the autoscaler's
instance manager may still hold instances whose instance_type is no
longer present in the active node_type_configs. Both
`report_instances` and `report_resources` index status_count_by_type
and node_type_configs by `instance.instance_type`, which raises
KeyError for those lingering instances and stops metric updates
mid-loop.

This change adds a `_filter_active_instances` helper that drops
instances whose type is missing from the active config (logging at
INFO level) before both reporting loops, so the autoscaler can
continue to publish metrics for the remaining active node types
instead of failing.

Signed-off-by: daiping8 <daiping8@zte.com.cn>
@daiping8 daiping8 requested a review from a team as a code owner June 22, 2026 10:25

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to filter out instances with unknown types in the autoscaler metrics reporter, preventing potential KeyError exceptions when a RayWorkerGroup CR is dynamically removed. Feedback suggests changing the log level from INFO to DEBUG when skipping unknown instances to avoid log spam, and refactoring a duplicated helper function _get_metrics in the test file to a module-level helper.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/autoscaler/v2/metrics_reporter.py Outdated
Comment thread python/ray/autoscaler/v2/tests/test_metrics_reporter.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 31a38df. Configure here.

resource_map[resource_name] += resource_value * count

instances = self._filter_active_instances(instances, node_type_configs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale resource gauges after filter

Medium Severity

When every instance is dropped by _filter_active_instances, report_resources exits without calling .set() on any gauges, while report_instances still zeros node-type gauges in the same tick. During a removed worker-group drain, Prometheus can show zero active nodes but unchanged autoscaler_cluster_resources / autoscaler_pending_resources from the prior scrape.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 31a38df. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-existing behavior, not introduced by this PR.

report_resources only calls .set() on gauges for resources present in pending_resources / cluster_resources, both of which are defaultdict(float) populated only when at least one instance is pending or running. So whenever every instance is TERMINATED (or every instance is filtered out), both dicts are empty and no .set() is called — the gauges retain the value from the prior scrape. That's the case on master today, independent of this change.

This PR's _filter_active_instances adds one more path that can produce an empty dict (instances skipped because their type was removed), but doesn't change the empty-dict → no-.set() semantics. Fixing it properly means seeding pending_resources / cluster_resources from node_type_configs (so every configured resource is always reset to 0 before accumulation) — that's a behavior change to the reporter's contract and out of scope for a KeyError fix.

I'll file a follow-up issue for the stale-gauge behavior so it can be addressed separately.

…lper

- Use ray.util.debug.log_once keyed by instance_type so the unknown-type
  skip message is emitted at most once per type, instead of twice per
  tick per instance. Keeps INFO visibility for persistent config
  mismatches while avoiding log spam during transient drain windows.
- Extract the duplicated _get_metrics test helper to module level so
  both test_report_nodes_resources and test_report_skips_unknown_instance_types
  reuse it.

Signed-off-by: daiping8 <daiping8@zte.com.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant