[Autoscaler][v2] Skip instances of unknown type in AutoscalerMetricsReporter to avoid KeyError#64250
Conversation
When a RayWorkerGroup CR is dynamically removed, the autoscaler's instance manager may still hold instances whose instance_type is no longer present in the active node_type_configs. Both `report_instances` and `report_resources` index status_count_by_type and node_type_configs by `instance.instance_type`, which raises KeyError for those lingering instances and stops metric updates mid-loop. This change adds a `_filter_active_instances` helper that drops instances whose type is missing from the active config (logging at INFO level) before both reporting loops, so the autoscaler can continue to publish metrics for the remaining active node types instead of failing. Signed-off-by: daiping8 <daiping8@zte.com.cn>
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to filter out instances with unknown types in the autoscaler metrics reporter, preventing potential KeyError exceptions when a RayWorkerGroup CR is dynamically removed. Feedback suggests changing the log level from INFO to DEBUG when skipping unknown instances to avoid log spam, and refactoring a duplicated helper function _get_metrics in the test file to a module-level helper.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 31a38df. Configure here.
| resource_map[resource_name] += resource_value * count | ||
|
|
||
| instances = self._filter_active_instances(instances, node_type_configs) | ||
|
|
There was a problem hiding this comment.
Stale resource gauges after filter
Medium Severity
When every instance is dropped by _filter_active_instances, report_resources exits without calling .set() on any gauges, while report_instances still zeros node-type gauges in the same tick. During a removed worker-group drain, Prometheus can show zero active nodes but unchanged autoscaler_cluster_resources / autoscaler_pending_resources from the prior scrape.
Reviewed by Cursor Bugbot for commit 31a38df. Configure here.
There was a problem hiding this comment.
This is a pre-existing behavior, not introduced by this PR.
report_resources only calls .set() on gauges for resources present in pending_resources / cluster_resources, both of which are defaultdict(float) populated only when at least one instance is pending or running. So whenever every instance is TERMINATED (or every instance is filtered out), both dicts are empty and no .set() is called — the gauges retain the value from the prior scrape. That's the case on master today, independent of this change.
This PR's _filter_active_instances adds one more path that can produce an empty dict (instances skipped because their type was removed), but doesn't change the empty-dict → no-.set() semantics. Fixing it properly means seeding pending_resources / cluster_resources from node_type_configs (so every configured resource is always reset to 0 before accumulation) — that's a behavior change to the reporter's contract and out of scope for a KeyError fix.
I'll file a follow-up issue for the stale-gauge behavior so it can be addressed separately.
…lper - Use ray.util.debug.log_once keyed by instance_type so the unknown-type skip message is emitted at most once per type, instead of twice per tick per instance. Keeps INFO visibility for persistent config mismatches while avoiding log spam during transient drain windows. - Extract the duplicated _get_metrics test helper to module level so both test_report_nodes_resources and test_report_skips_unknown_instance_types reuse it. Signed-off-by: daiping8 <daiping8@zte.com.cn>


Description
In KubeRay deployments running autoscaler v2, a
RayWorkerGroupCR can be removed dynamically (e.g. by editing theRayClusterspec). Instances that were created under that group linger in the Instance Manager state until they reachTERMINATED, but theirinstance_typeis no longer present in the activenode_type_configsonce the config is rebuilt from the updated spec.The bug
AutoscalerMetricsReporter.report_instancesandreport_resources(inpython/ray/autoscaler/v2/metrics_reporter.py) assume a closed-world invariant: everyinstance.instance_typeis a key in the currentnode_type_configs. That invariant does not hold during the drain window of a removed worker group:report_instancesinitializesstatus_count_by_typeonly fromnode_type_configs.keys(), then unconditionally doesstatus_count_by_type[instance.instance_type][...] += 1report_resourcescalls_add_resources(..., node_type_configs, instance.instance_type, 1), which dereferencesnode_type_configs[node_type]So the next reporter tick raises:
Impact
The exception propagates out of the reporting loops and aborts the rest of that tick's metric updates. As a result, the following Prometheus gauges go stale for every node type (not just the removed one) until the lingering instances are fully garbage-collected:
autoscaler_pending_nodesautoscaler_active_nodesautoscaler_recently_failed_nodesautoscaler_pending_resourcesautoscaler_cluster_resourcesOperators see missing or stale autoscaler metrics precisely during the scale-down event they most want visibility into, and may also see autoscaler error-loop noise from the recurring
KeyError.Reproduction
RAY_enable_autoscaler_v2=1).RayClusterwith at least one worker group, e.g.worker-group-a.worker-group-aup so at least one instance reaches a non-terminal state (e.g.ALLOCATED,RAY_INSTALLING,RAY_RUNNING,TERMINATING).worker-group-afrom theRayClusterspec (dynamic CR update), which drops"worker-group-a"fromnode_type_configs.KeyError: 'worker-group-a'is raised insidereport_instances/report_resources, and the gauges above are not updated for that tick.The fix
Treat an unknown
instance_typeas a transient condition rather than a hard error:_filter_active_instances— a@staticmethodonAutoscalerMetricsReporterthat returns only instances whoseinstance_typeis still present innode_type_configs, logging atINFOfor each skipped instance (so operators can still see the drain happening):report_instancesandreport_resourcesnow runinstances = self._filter_active_instances(instances, node_type_configs)before their counting/aggregation loops, so a removed node type can no longer poison the loop.Verification
Added
test_report_skips_unknown_instance_typesinpython/ray/autoscaler/v2/tests/test_metrics_reporter.py:node_type_configscontaining onlytype_1.type_1and aremoved_typeinstance.report_instancesandreport_resources; both must complete without raisingKeyError.type_1reflect only thetype_1instances (1 active, 1 pending, 1 CPU cluster resource, 1 CPU pending resource) and that theremoved_typeinstance is silently dropped.# Run the metrics reporter tests. pytest -xvs python/ray/autoscaler/v2/tests/test_metrics_reporter.py