[Autoscaler][v2] Skip instances of unknown type in AutoscalerMetricsReporter to avoid KeyError by daiping8 · Pull Request #64250 · ray-project/ray

daiping8 · 2026-06-22T10:25:12Z

Description

In KubeRay deployments running autoscaler v2, a RayWorkerGroup CR can be removed dynamically (e.g. by editing the RayCluster spec). Instances that were created under that group linger in the Instance Manager state until they reach TERMINATED, but their instance_type is no longer present in the active node_type_configs once the config is rebuilt from the updated spec.

The bug

AutoscalerMetricsReporter.report_instances and report_resources (in python/ray/autoscaler/v2/metrics_reporter.py) assume a closed-world invariant: every instance.instance_type is a key in the current node_type_configs. That invariant does not hold during the drain window of a removed worker group:

report_instances initializes status_count_by_type only from node_type_configs.keys(), then unconditionally does
status_count_by_type[instance.instance_type][...] += 1
report_resources calls _add_resources(..., node_type_configs, instance.instance_type, 1), which dereferences
node_type_configs[node_type]

So the next reporter tick raises:

KeyError: '<removed-group-name>'

Impact

The exception propagates out of the reporting loops and aborts the rest of that tick's metric updates. As a result, the following Prometheus gauges go stale for every node type (not just the removed one) until the lingering instances are fully garbage-collected:

autoscaler_pending_nodes
autoscaler_active_nodes
autoscaler_recently_failed_nodes
autoscaler_pending_resources
autoscaler_cluster_resources

Operators see missing or stale autoscaler metrics precisely during the scale-down event they most want visibility into, and may also see autoscaler error-loop noise from the recurring KeyError.

Reproduction

Deploy a KubeRay cluster with autoscaler v2 enabled (RAY_enable_autoscaler_v2=1).
Configure the RayCluster with at least one worker group, e.g. worker-group-a.
Scale worker-group-a up so at least one instance reaches a non-terminal state (e.g. ALLOCATED, RAY_INSTALLING, RAY_RUNNING, TERMINATING).
Remove worker-group-a from the RayCluster spec (dynamic CR update), which drops "worker-group-a" from node_type_configs.
Wait for the next autoscaler tick → KeyError: 'worker-group-a' is raised inside report_instances / report_resources, and the gauges above are not updated for that tick.

The fix

Treat an unknown instance_type as a transient condition rather than a hard error:

New helper _filter_active_instances — a @staticmethod on AutoscalerMetricsReporter that returns only instances whose instance_type is still present in node_type_configs, logging at INFO for each skipped instance (so operators can still see the drain happening):
```
logger.info(
    "Skipping instance %s with unknown type %r, active types: %s",
    instance.instance_id,
    instance.instance_type,
    list(active_types.keys()),
)
```
Apply the filter in both reporting paths — report_instances and report_resources now run instances = self._filter_active_instances(instances, node_type_configs) before their counting/aggregation loops, so a removed node type can no longer poison the loop.

Verification

Added test_report_skips_unknown_instance_types in python/ray/autoscaler/v2/tests/test_metrics_reporter.py:

Constructs node_type_configs containing only type_1.
Builds an instance list mixing type_1 and a removed_type instance.
Calls both report_instances and report_resources; both must complete without raising KeyError.
Asserts the resulting Prometheus samples for type_1 reflect only the type_1 instances (1 active, 1 pending, 1 CPU cluster resource, 1 CPU pending resource) and that the removed_type instance is silently dropped.

# Run the metrics reporter tests.
pytest -xvs python/ray/autoscaler/v2/tests/test_metrics_reporter.py

When a RayWorkerGroup CR is dynamically removed, the autoscaler's instance manager may still hold instances whose instance_type is no longer present in the active node_type_configs. Both `report_instances` and `report_resources` index status_count_by_type and node_type_configs by `instance.instance_type`, which raises KeyError for those lingering instances and stops metric updates mid-loop. This change adds a `_filter_active_instances` helper that drops instances whose type is missing from the active config (logging at INFO level) before both reporting loops, so the autoscaler can continue to publish metrics for the remaining active node types instead of failing. Signed-off-by: daiping8 <daiping8@zte.com.cn>

gemini-code-assist

Code Review

This pull request introduces a mechanism to filter out instances with unknown types in the autoscaler metrics reporter, preventing potential KeyError exceptions when a RayWorkerGroup CR is dynamically removed. Feedback suggests changing the log level from INFO to DEBUG when skipping unknown instances to avoid log spam, and refactoring a duplicated helper function _get_metrics in the test file to a module-level helper.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 31a38df. Configure here.}

cursor · 2026-06-22T10:26:54Z

                resource_map[resource_name] += resource_value * count

+        instances = self._filter_active_instances(instances, node_type_configs)
+


Stale resource gauges after filter

Medium Severity

When every instance is dropped by _filter_active_instances, report_resources exits without calling .set() on any gauges, while report_instances still zeros node-type gauges in the same tick. During a removed worker-group drain, Prometheus can show zero active nodes but unchanged autoscaler_cluster_resources / autoscaler_pending_resources from the prior scrape.

^{Reviewed by Cursor Bugbot for commit 31a38df. Configure here.}

This is a pre-existing behavior, not introduced by this PR.

report_resources only calls .set() on gauges for resources present in pending_resources / cluster_resources, both of which are defaultdict(float) populated only when at least one instance is pending or running. So whenever every instance is TERMINATED (or every instance is filtered out), both dicts are empty and no .set() is called — the gauges retain the value from the prior scrape. That's the case on master today, independent of this change.

This PR's _filter_active_instances adds one more path that can produce an empty dict (instances skipped because their type was removed), but doesn't change the empty-dict → no-.set() semantics. Fixing it properly means seeding pending_resources / cluster_resources from node_type_configs (so every configured resource is always reset to 0 before accumulation) — that's a behavior change to the reporter's contract and out of scope for a KeyError fix.

I'll file a follow-up issue for the stale-gauge behavior so it can be addressed separately.

…lper - Use ray.util.debug.log_once keyed by instance_type so the unknown-type skip message is emitted at most once per type, instead of twice per tick per instance. Keeps INFO visibility for persistent config mismatches while avoiding log spam during transient drain windows. - Extract the duplicated _get_metrics test helper to module level so both test_report_nodes_resources and test_report_skips_unknown_instance_types reuse it. Signed-off-by: daiping8 <daiping8@zte.com.cn>

daiping8 requested a review from a team as a code owner June 22, 2026 10:25

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread python/ray/autoscaler/v2/metrics_reporter.py Outdated

Comment thread python/ray/autoscaler/v2/tests/test_metrics_reporter.py Outdated

cursor Bot reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler][v2] Skip instances of unknown type in AutoscalerMetricsReporter to avoid KeyError#64250

[Autoscaler][v2] Skip instances of unknown type in AutoscalerMetricsReporter to avoid KeyError#64250
daiping8 wants to merge 2 commits into
ray-project:masterfrom
daiping8:fix/autoscaler-keyerror-unknown-instance-type

daiping8 commented Jun 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 22, 2026

Uh oh!

daiping8 Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		resource_map[resource_name] += resource_value * count

		instances = self._filter_active_instances(instances, node_type_configs)

Conversation

daiping8 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The bug

Impact

Reproduction

The fix

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 22, 2026

Choose a reason for hiding this comment

Stale resource gauges after filter

Uh oh!

daiping8 Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

daiping8 commented Jun 22, 2026 •

edited

Loading