Skip to content

feat(admission): enable hpa for KAI Admission Service#911

Open
faizan-exe wants to merge 6 commits intoNVIDIA:mainfrom
faizan-exe:feat/issue-901
Open

feat(admission): enable hpa for KAI Admission Service#911
faizan-exe wants to merge 6 commits intoNVIDIA:mainfrom
faizan-exe:feat/issue-901

Conversation

@faizan-exe
Copy link
Contributor

@faizan-exe faizan-exe commented Jan 21, 2026

Description

This PR enables Horizontal Pod Autoscaling for Admission Pods. Previously, user can only specifiy fixed number of replicas for admission pods using kai-config.yaml - now they have the option to enable autoscaling and avoid manual configuration.

The metrics for autoscaling are

  • controller_runtime_webhook_requests_total
  • process_cpu_seconds_total

Prometheus Adapter is also introduced for interacting with Custom Metrics API.

Related Issues

Issue #901

Checklist

  • [Yes] Self-reviewed
  • [Yes] Added/updated tests (if needed)
  • [Yes] Updated documentation (if needed)

Breaking Changes

No breaking changes. Backward Compatibility is being ensured.

@github-actions
Copy link

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission 20.37% (-0.68%) 👎
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission 87.77% (+0.57%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission.go 100.00% (ø) 22 (+6) 22 (+6) 0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/zz_generated.deepcopy.go 0.00% (ø) 86 (+26) 0 86 (+26)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/admission.go 72.00% (ø) 25 18 7
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources.go 91.23% (+0.23%) 114 (+14) 104 (+13) 10 (+1) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission_test.go
  • github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources_test.go

1 similar comment
@github-actions
Copy link

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission 20.37% (-0.68%) 👎
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission 87.77% (+0.57%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission.go 100.00% (ø) 22 (+6) 22 (+6) 0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/zz_generated.deepcopy.go 0.00% (ø) 86 (+26) 0 86 (+26)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/admission.go 72.00% (ø) 25 18 7
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources.go 91.23% (+0.23%) 114 (+14) 104 (+13) 10 (+1) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission_test.go
  • github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources_test.go

Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>
Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>
Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>
Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>
Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>
Signed-off-by: Ahmed Faizan <faizanofficial120@gmail.com>

Add the Prometheus Adapter Helm repository:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop these two lines as they're duplicates from lines 4-5 in this file

name: admission
namespace: kai-scheduler
labels:
accounting: kai-scheduler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an "accounting" prometheus that's separate from a general prometheus for metrics (and autoscaling).
The reasons are:

  1. When using time aware fairness, the accounting prometheus is in the critical path of the scheduler. While the scheduler will work even if it's down, it could make very different scheduling decisions, causing unexpected preemptions. So we should keep it very minimal and keep load off of it.
  2. The accounting prometheus, in some expected use cases, needs to be persistent, even for a year or more of data. Since you can't choose which metrics are persisted and which aren't, this means that every metric that it's collecting will take that space. But the more operational metrics (like the ones used for HPA, or other metrics like scheduler metrics) might not need this persistency, causing wasted storage space
  3. The accounting prometheus can potentially have different sampling intervals than the general prometheus

Maybe we can find some design where the separation is configurable, or maybe it will just be easier to have them always separate.

What do you think?

@github-actions
Copy link

github-actions bot commented Feb 2, 2026

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission 20.35% (-0.70%) 👎
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission 87.77% (+0.57%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission.go 100.00% (ø) 23 (+7) 23 (+7) 0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/zz_generated.deepcopy.go 0.00% (ø) 90 (+30) 0 90 (+30)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/admission.go 72.00% (ø) 25 18 7
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources.go 91.23% (+0.23%) 114 (+14) 104 (+13) 10 (+1) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/admission/admission_test.go
  • github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/admission/resources_test.go

},
Target: autoscalingv2.MetricTarget{
Type: autoscalingv2.AverageValueMetricType,
AverageValue: resource.NewMilliQuantity(int64(*config.Autoscaling.CPUUtilizationPercent)*10, resource.DecimalSI),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this multiplied by 10? What is the meaning of this Average Value here and why would it get down when scaling?

deployment.Spec.Strategy.RollingUpdate = nil
deployment.Spec.Replicas = config.Replicas
if config.Autoscaling == nil || !*config.Autoscaling.Enabled {
deployment.Spec.Replicas = config.Replicas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be set to nil otherwise? notice that it reads the current deployment in and only updates those fields.

# SPDX-License-Identifier: Apache-2.0

prometheus:
url: http://prometheus-operated.kai-scheduler.svc.cluster.local
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use another instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants