feat(admission): enable hpa for KAI Admission Service by faizan-exe · Pull Request #911 · NVIDIA/KAI-Scheduler

faizan-exe · 2026-01-21T10:45:04Z

Description

This PR enables Horizontal Pod Autoscaling for Admission Pods. Previously, user can only specifiy fixed number of replicas for admission pods using kai-config.yaml - now they have the option to enable autoscaling and avoid manual configuration.

The metrics for autoscaling are

controller_runtime_webhook_requests_total
process_cpu_seconds_total

Prometheus Adapter is also introduced for interacting with Custom Metrics API.

Related Issues

Issue #901

Checklist

[Yes] Self-reviewed
[Yes] Added/updated tests (if needed)
[Yes] Updated documentation (if needed)

Breaking Changes

No breaking changes. Backward Compatibility is being ensured.

pkg/apis/kai/v1/admission/admission.go

pkg/operator/operands/admission/resources.go

pkg/operator/operands/common/common.go

github-actions · 2026-01-25T15:35:27Z

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission	20.37% (-0.68%)	👎
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission	87.77% (+0.57%)	👍

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission.go	100.00% (ø)	22 (+6)	22 (+6)	0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/zz_generated.deepcopy.go	0.00% (ø)	86 (+26)	0	86 (+26)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/admission.go	72.00% (ø)	25	18	7
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources.go	91.23% (+0.23%)	114 (+14)	104 (+13)	10 (+1)	👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission_test.go
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources_test.go

github-actions · 2026-01-25T16:19:12Z

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission	20.37% (-0.68%)	👎
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission	87.77% (+0.57%)	👍

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission.go	100.00% (ø)	22 (+6)	22 (+6)	0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/zz_generated.deepcopy.go	0.00% (ø)	86 (+26)	0	86 (+26)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/admission.go	72.00% (ø)	25	18	7
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources.go	91.23% (+0.23%)	114 (+14)	104 (+13)	10 (+1)	👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission_test.go
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources_test.go

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

itsomri · 2026-02-02T17:44:13Z

docs/metrics/README.md

+
+Add the Prometheus Adapter Helm repository:
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts


I think we can drop these two lines as they're duplicates from lines 4-5 in this file

itsomri · 2026-02-02T17:52:01Z

docs/metrics/service-monitors.yaml

+  name: admission
+  namespace: kai-scheduler
+  labels:
+    accounting: kai-scheduler


I think we should have an "accounting" prometheus that's separate from a general prometheus for metrics (and autoscaling).
The reasons are:

When using time aware fairness, the accounting prometheus is in the critical path of the scheduler. While the scheduler will work even if it's down, it could make very different scheduling decisions, causing unexpected preemptions. So we should keep it very minimal and keep load off of it.

The accounting prometheus, in some expected use cases, needs to be persistent, even for a year or more of data. Since you can't choose which metrics are persisted and which aren't, this means that every metric that it's collecting will take that space. But the more operational metrics (like the ones used for HPA, or other metrics like scheduler metrics) might not need this persistency, causing wasted storage space

The accounting prometheus can potentially have different sampling intervals than the general prometheus

Maybe we can find some design where the separation is configurable, or maybe it will just be easier to have them always separate.

What do you think?

github-actions · 2026-02-02T18:18:20Z

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission	20.35% (-0.70%)	👎
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission	87.77% (+0.57%)	👍

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission.go	100.00% (ø)	23 (+7)	23 (+7)	0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/zz_generated.deepcopy.go	0.00% (ø)	90 (+30)	0	90 (+30)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/admission.go	72.00% (ø)	25	18	7
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources.go	91.23% (+0.23%)	114 (+14)	104 (+13)	10 (+1)	👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission_test.go
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources_test.go

enoodle · 2026-02-03T15:51:40Z

pkg/operator/operands/admission/resources.go

+				},
+				Target: autoscalingv2.MetricTarget{
+					Type:         autoscalingv2.AverageValueMetricType,
+					AverageValue: resource.NewMilliQuantity(int64(*config.Autoscaling.CPUUtilizationPercent)*10, resource.DecimalSI),


Why is this multiplied by 10? What is the meaning of this Average Value here and why would it get down when scaling?

enoodle · 2026-02-03T15:54:20Z

pkg/operator/operands/admission/resources.go

 	deployment.Spec.Strategy.RollingUpdate = nil
-	deployment.Spec.Replicas = config.Replicas
+	if config.Autoscaling == nil || !*config.Autoscaling.Enabled {
+		deployment.Spec.Replicas = config.Replicas


Maybe it should be set to nil otherwise? notice that it reads the current deployment in and only updates those fields.

enoodle · 2026-02-03T15:58:36Z

docs/metrics/prometheus-adapter-values.yaml

+# SPDX-License-Identifier: Apache-2.0
+
+prometheus:
+  url: http://prometheus-operated.kai-scheduler.svc.cluster.local


We should use another instance

enoodle requested changes Jan 21, 2026

View reviewed changes

pkg/apis/kai/v1/admission/admission.go Outdated Show resolved Hide resolved

pkg/operator/operands/admission/resources.go Outdated Show resolved Hide resolved

pkg/operator/operands/common/common.go Outdated Show resolved Hide resolved

faizan-exe force-pushed the feat/issue-901 branch from f312be0 to 0043f41 Compare January 24, 2026 14:51

faizan-exe requested a review from enoodle January 24, 2026 14:53

faizan-exe force-pushed the feat/issue-901 branch from a2448b8 to 0e468f1 Compare February 1, 2026 14:52

faizan-exe added 5 commits February 1, 2026 21:08

feat(admission): enable hpa for KAI Admission Service

dedeafd

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

chore: append CHANELOG.md and resolve comments

ec660d2

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

chore: revert comment

9c5bd34

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

chore: run make validate

f47fa81

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

chore: add cpu utilization

f5fef74

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

faizan-exe force-pushed the feat/issue-901 branch from 0e468f1 to f5fef74 Compare February 1, 2026 16:10

chore: update kai-scheduler configs

5c4ac02

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

itsomri reviewed Feb 2, 2026

View reviewed changes

enoodle reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(admission): enable hpa for KAI Admission Service#911

feat(admission): enable hpa for KAI Admission Service#911
faizan-exe wants to merge 6 commits intoNVIDIA:mainfrom
faizan-exe:feat/issue-901

faizan-exe commented Jan 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 25, 2026

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Jan 25, 2026

Changed files (no unit tests)

Changed unit test files

Uh oh!

itsomri Feb 2, 2026

Uh oh!

itsomri Feb 2, 2026

Uh oh!

github-actions bot commented Feb 2, 2026

Changed files (no unit tests)

Changed unit test files

Uh oh!

enoodle Feb 3, 2026

Uh oh!

enoodle Feb 3, 2026

Uh oh!

enoodle Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

faizan-exe commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Checklist

Breaking Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 25, 2026

Merging this branch changes the coverage (1 decrease, 1 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Jan 25, 2026

Merging this branch changes the coverage (1 decrease, 1 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

itsomri Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

itsomri Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 2, 2026

Merging this branch changes the coverage (1 decrease, 1 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

enoodle Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

enoodle Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

enoodle Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

faizan-exe commented Jan 21, 2026 •

edited

Loading