Skip to content

fix: exclude Releasing pods from max pods predicate count#964

Open
itsomri wants to merge 1 commit intomainfrom
fix/max-pods-exclude-releasing
Open

fix: exclude Releasing pods from max pods predicate count#964
itsomri wants to merge 1 commit intomainfrom
fix/max-pods-exclude-releasing

Conversation

@itsomri
Copy link
Collaborator

@itsomri itsomri commented Feb 4, 2026

Description

This fixes a bug where the max pods predicate incorrectly counted pods in Releasing state (being preempted/evicted) toward the node's max pod limit, preventing new pods from scheduling even when slots were freeing up.

Checklist

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

This fixes a bug where the max pods predicate incorrectly counted pods
in Releasing state (being preempted/evicted) toward the node's max pod
limit, preventing new pods from scheduling even when slots were freeing up.

Solution: Added AllocatedPodCount field to NodeInfo that tracks only
actively allocated pods (excludes Releasing). This provides O(1) access
with no map lookups or event handler overhead.

- Added NodeInfo.AllocatedPodCount field
- Incremented in AddTask for actively allocated pods
- Decremented in RemoveTask for actively allocated pods
- Updated max pods predicate to use AllocatedPodCount
- Added unit test for preemption scenario with 110 pods
- Updated all test expectations to validate AllocatedPodCount

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

📊 Performance Benchmark Results

Comparing PR (fix/max-pods-exclude-releasing) vs main branch:

goos: linux
goarch: amd64
pkg: github.com/NVIDIA/KAI-scheduler/pkg/scheduler/actions
cpu: AMD EPYC 7763 64-Core Processor                
                                    │ main-bench.txt │            pr-bench.txt            │
                                    │     sec/op     │    sec/op     vs base              │
AllocateAction_SmallCluster-4           108.4m ±  1%   108.4m ±  7%       ~ (p=0.937 n=6)
AllocateAction_MediumCluster-4          136.5m ±  1%   137.7m ±  2%  +0.83% (p=0.026 n=6)
AllocateAction_LargeCluster-4           224.5m ± 11%   225.8m ± 13%       ~ (p=0.937 n=6)
ReclaimAction_SmallCluster-4            102.7m ±  0%   102.9m ±  0%  +0.17% (p=0.015 n=6)
ReclaimAction_MediumCluster-4           105.4m ±  0%   105.7m ±  0%  +0.25% (p=0.002 n=6)
PreemptAction_SmallCluster-4            103.6m ±  0%   103.6m ±  0%       ~ (p=0.818 n=6)
PreemptAction_MediumCluster-4           112.7m ±  0%   113.0m ±  0%  +0.30% (p=0.026 n=6)
ConsolidationAction_SmallCluster-4      113.5m ±  0%   113.7m ±  0%       ~ (p=0.065 n=6)
ConsolidationAction_MediumCluster-4     201.4m ±  1%   201.9m ±  2%       ~ (p=0.589 n=6)
FullSchedulingCycle_SmallCluster-4      105.1m ±  0%   105.3m ±  0%       ~ (p=0.589 n=6)
FullSchedulingCycle_MediumCluster-4     119.1m ±  1%   119.2m ±  1%       ~ (p=0.180 n=6)
FullSchedulingCycle_LargeCluster-4      158.0m ±  2%   159.4m ±  1%       ~ (p=0.093 n=6)
ManyQueues_MediumCluster-4              139.9m ±  0%   140.5m ±  1%  +0.41% (p=0.041 n=6)
GangScheduling_MediumCluster-4          157.1m ±  2%   159.1m ±  1%       ~ (p=0.394 n=6)
geomean                                 130.6m         131.0m        +0.38%

                                    │ main-bench.txt │            pr-bench.txt            │
                                    │      B/op      │     B/op      vs base              │
AllocateAction_SmallCluster-4           2.153Mi ± 0%   2.153Mi ± 1%       ~ (p=0.699 n=6)
AllocateAction_MediumCluster-4          11.84Mi ± 0%   11.84Mi ± 0%       ~ (p=0.310 n=6)
AllocateAction_LargeCluster-4           41.54Mi ± 0%   41.54Mi ± 0%       ~ (p=0.065 n=6)
ReclaimAction_SmallCluster-4            890.8Ki ± 1%   889.0Ki ± 1%       ~ (p=0.937 n=6)
ReclaimAction_MediumCluster-4           2.832Mi ± 0%   2.833Mi ± 0%  +0.02% (p=0.026 n=6)
PreemptAction_SmallCluster-4            1.007Mi ± 1%   1.005Mi ± 0%       ~ (p=0.485 n=6)
PreemptAction_MediumCluster-4           4.016Mi ± 0%   4.017Mi ± 0%  +0.02% (p=0.026 n=6)
ConsolidationAction_SmallCluster-4      5.604Mi ± 0%   5.606Mi ± 0%       ~ (p=0.589 n=6)
ConsolidationAction_MediumCluster-4     46.89Mi ± 0%   46.88Mi ± 0%  -0.02% (p=0.015 n=6)
FullSchedulingCycle_SmallCluster-4      1.372Mi ± 1%   1.373Mi ± 1%       ~ (p=0.937 n=6)
FullSchedulingCycle_MediumCluster-4     6.836Mi ± 0%   6.837Mi ± 0%  +0.01% (p=0.041 n=6)
FullSchedulingCycle_LargeCluster-4      22.83Mi ± 0%   22.83Mi ± 0%  +0.01% (p=0.009 n=6)
ManyQueues_MediumCluster-4              16.30Mi ± 0%   16.31Mi ± 0%       ~ (p=0.310 n=6)
GangScheduling_MediumCluster-4          17.17Mi ± 0%   17.17Mi ± 0%       ~ (p=0.310 n=6)
geomean                                 6.331Mi        6.330Mi       -0.02%

                                    │ main-bench.txt │           pr-bench.txt            │
                                    │   allocs/op    │  allocs/op   vs base              │
AllocateAction_SmallCluster-4            36.20k ± 0%   36.20k ± 0%       ~ (p=0.762 n=6)
AllocateAction_MediumCluster-4           325.2k ± 0%   325.2k ± 0%       ~ (p=0.838 n=6)
AllocateAction_LargeCluster-4            1.394M ± 0%   1.394M ± 0%       ~ (p=0.253 n=6)
ReclaimAction_SmallCluster-4             8.396k ± 0%   8.396k ± 0%       ~ (p=0.474 n=6)
ReclaimAction_MediumCluster-4            26.54k ± 0%   26.54k ± 0%       ~ (p=0.773 n=6)
PreemptAction_SmallCluster-4             11.19k ± 0%   11.19k ± 0%       ~ (p=0.883 n=6)
PreemptAction_MediumCluster-4            38.77k ± 0%   38.77k ± 0%       ~ (p=0.210 n=6)
ConsolidationAction_SmallCluster-4       73.57k ± 0%   73.56k ± 0%       ~ (p=0.937 n=6)
ConsolidationAction_MediumCluster-4      685.9k ± 0%   685.8k ± 0%  -0.02% (p=0.009 n=6)
FullSchedulingCycle_SmallCluster-4       21.36k ± 0%   21.36k ± 0%       ~ (p=0.920 n=6)
FullSchedulingCycle_MediumCluster-4      174.7k ± 0%   174.7k ± 0%       ~ (p=0.714 n=6)
FullSchedulingCycle_LargeCluster-4       727.3k ± 0%   727.2k ± 0%       ~ (p=0.288 n=6)
ManyQueues_MediumCluster-4               363.3k ± 0%   363.3k ± 0%       ~ (p=0.485 n=6)
GangScheduling_MediumCluster-4           597.0k ± 0%   597.0k ± 0%       ~ (p=0.491 n=6)
geomean                                  111.7k        111.7k       -0.00%

Legend

  • 📉 Negative delta = Performance improvement (faster)
  • 📈 Positive delta = Performance regression (slower)
  • p-value < 0.05 indicates statistically significant change
Raw benchmark data

PR branch:

goos: linux
goarch: amd64
pkg: github.com/NVIDIA/KAI-scheduler/pkg/scheduler/actions
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkAllocateAction_SmallCluster-4         	       9	 116477692 ns/op	 2282848 B/op	   36222 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108451592 ns/op	 2257153 B/op	   36208 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108585555 ns/op	 2257381 B/op	   36205 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108358498 ns/op	 2254578 B/op	   36201 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108080776 ns/op	 2257502 B/op	   36204 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107960171 ns/op	 2255484 B/op	   36201 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 138224614 ns/op	12431492 B/op	  325193 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137495678 ns/op	12420813 B/op	  325196 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 140181365 ns/op	12418333 B/op	  325194 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137756104 ns/op	12417619 B/op	  325194 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137549127 ns/op	12417387 B/op	  325188 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 135514726 ns/op	12417348 B/op	  325186 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 223819990 ns/op	43559267 B/op	 1394296 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 228968568 ns/op	43558366 B/op	 1394290 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 211002789 ns/op	43559625 B/op	 1394296 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 255925724 ns/op	43558620 B/op	 1394296 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 207813262 ns/op	43559196 B/op	 1394298 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 227740660 ns/op	43559363 B/op	 1394303 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102953222 ns/op	  901252 B/op	    8362 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102889098 ns/op	  907135 B/op	    8386 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102810247 ns/op	  910437 B/op	    8396 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102899512 ns/op	  910298 B/op	    8396 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102929332 ns/op	  910279 B/op	    8395 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102945522 ns/op	  915410 B/op	    8396 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105661902 ns/op	 2970191 B/op	   26539 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105609791 ns/op	 2966258 B/op	   26537 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105682363 ns/op	 2970372 B/op	   26540 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105718262 ns/op	 2970144 B/op	   26540 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105723356 ns/op	 2970279 B/op	   26540 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105691566 ns/op	 2970316 B/op	   26540 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103538601 ns/op	 1052204 B/op	   11187 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103655463 ns/op	 1056189 B/op	   11189 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103675773 ns/op	 1055983 B/op	   11188 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103671617 ns/op	 1048183 B/op	   11185 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103570434 ns/op	 1054695 B/op	   11185 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103549994 ns/op	 1052149 B/op	   11187 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112738702 ns/op	 4211675 B/op	   38769 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113038225 ns/op	 4216146 B/op	   38771 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113090850 ns/op	 4211588 B/op	   38768 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112868662 ns/op	 4215659 B/op	   38769 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113212489 ns/op	 4211716 B/op	   38770 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113058395 ns/op	 4211624 B/op	   38769 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113640958 ns/op	 5877709 B/op	   73572 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113686303 ns/op	 5875288 B/op	   73558 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113847678 ns/op	 5880102 B/op	   73597 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113776719 ns/op	 5879933 B/op	   73523 allocs/op

Main branch:

goos: linux
goarch: amd64
pkg: github.com/NVIDIA/KAI-scheduler/pkg/scheduler/actions
cpu: AMD EPYC 7763 64-Core Processor                
BenchmarkAllocateAction_SmallCluster-4         	      10	 109092445 ns/op	 2258392 B/op	   36212 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108293003 ns/op	 2255216 B/op	   36202 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108689298 ns/op	 2258016 B/op	   36208 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108072238 ns/op	 2262378 B/op	   36204 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108561805 ns/op	 2257364 B/op	   36205 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108326175 ns/op	 2256340 B/op	   36204 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134948224 ns/op	12417936 B/op	  325200 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 136493837 ns/op	12419451 B/op	  325195 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 136838766 ns/op	12418757 B/op	  325199 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137306307 ns/op	12415803 B/op	  325185 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 136543496 ns/op	12415624 B/op	  325183 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134544389 ns/op	12416804 B/op	  325193 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 229189559 ns/op	43557131 B/op	 1394292 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 218804148 ns/op	43557569 B/op	 1394301 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 212386013 ns/op	43578441 B/op	 1394298 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 223304957 ns/op	43554096 B/op	 1394275 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 225605787 ns/op	43556265 B/op	 1394289 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 248509541 ns/op	43555804 B/op	 1394284 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102900908 ns/op	  901440 B/op	    8366 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102720276 ns/op	  906709 B/op	    8384 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102742075 ns/op	  914301 B/op	    8398 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102719330 ns/op	  910264 B/op	    8396 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102745312 ns/op	  914151 B/op	    8397 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102778780 ns/op	  915149 B/op	    8396 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105508231 ns/op	 2969530 B/op	   26539 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105274372 ns/op	 2969469 B/op	   26540 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105600151 ns/op	 2969519 B/op	   26540 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105428269 ns/op	 2969562 B/op	   26540 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105254826 ns/op	 2965466 B/op	   26537 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105426192 ns/op	 2965653 B/op	   26538 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103542641 ns/op	 1048039 B/op	   11185 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103608297 ns/op	 1055973 B/op	   11188 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103736091 ns/op	 1055984 B/op	   11188 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103503597 ns/op	 1055866 B/op	   11188 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103636695 ns/op	 1062116 B/op	   11187 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103588976 ns/op	 1055522 B/op	   11186 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112684503 ns/op	 4210950 B/op	   38770 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112739209 ns/op	 4210923 B/op	   38770 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112669001 ns/op	 4210776 B/op	   38770 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112891741 ns/op	 4210816 B/op	   38769 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112816978 ns/op	 4210964 B/op	   38770 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112670878 ns/op	 4215154 B/op	   38771 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113243861 ns/op	 5875451 B/op	   73550 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113538681 ns/op	 5876221 B/op	   73575 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113442688 ns/op	 5876520 B/op	   73581 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113696673 ns/op	 5888507 B/op	   73585 allocs/op

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant