Skip to content

feat: Expected Runtime Plugin for Soft Eviction via Requeue Action #941

Open
rich7420 wants to merge 6 commits intoNVIDIA:mainfrom
rich7420:KAI-904
Open

feat: Expected Runtime Plugin for Soft Eviction via Requeue Action #941
rich7420 wants to merge 6 commits intoNVIDIA:mainfrom
rich7420:KAI-904

Conversation

@rich7420
Copy link
Contributor

Description

Adds the Expected Runtime plugin: running jobs that exceed their configured expected runtime get nominated as requeue candidates. The plugin only does nomination; eviction is done by the Requeue action (elsewhere). Soft eviction: jobs become eligible when runtime ≥ expected, but are only evicted when a higher-priority workload needs the slot.

Why: Time-aware fairness (requeue only when there’s contention), opt-in via kai.scheduler/expected-runtime, cooldown via kai.scheduler/requeue-not-before to avoid thrashing.

What changed:

  • Plugin expectedruntime: registers RequeueCandidateNominationFn, nominates jobs that pass checks (running, preemptible, valid expected-runtime, runtime ≥ expected, cooldown expired).
  • Session API: RequeueCandidateNominationFn, AddRequeueCandidateNominationFn, CollectRequeueCandidates() (dedup by PodGroup UID).
  • Annotations: kai.scheduler/expected-runtime, requeue-delay, requeue-not-before.
  • Metrics: kai_requeue_nominations_total, kai_requeue_nomination_skipped_total (prefix from --metrics-namespace, default kai).
  • Operator: expectedruntime in default plugin list; docs in docs/plugins/expectedruntime.md.

Uses existing LastStartTimestamp; MinRuntime stays in Requeue action filters.

Related Issues

Closes #904

Checklist

Note: Ensure your PR title follows the Conventional Commits format (e.g., feat(scheduler): add new feature)

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

Additional Notes

@rich7420
Copy link
Contributor Author

cc @itsomri , @romanbaron

Comment on lines 167 to 168
- [Expected Runtime Plugin Design](../designs/expected-runtime-requeue/expected-runtime-plugin.md)
- [Requeue Flow Design](../designs/expected-runtime-requeue/expected-runtime-requeue-flow.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't exist yet.
I think we should first design the requeue action so that we will be able to design this plugin and only then implement it. Now it is a bit "in the air".
For example I am not sure we will need "requeue-delay" - it depends on the requeue action design, and I think should be introduced there. I am also not sure why we need both cooldown and expected runtime, are they not the same? And shouldn't we also look at the queue fair share to make sure we are not removing jobs that should keep running?
I think all those questions and more should be asked and discussed before we implement it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, those links pointed to design docs that don’t exist yet. I’ve removed them from “See Also” in this PR
I agree we should design the requeue action first (when it runs, try/commit/rollback, victim selection, how/whether to set requeue-not-before and cooldown), then align this plugin with that. This PR only adds the nomination API + this plugin; the Requeue action itself isn’t implemented here.
About cooldown and expected runtime, I think they’re different: expected runtime = “when does this job become eligible for eviction?” (time since start). Cooldown = “after we evicted it once, how long before we can nominate it again?” (stops thrashing).

Copy link
Contributor Author

@rich7420 rich7420 Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all those questions and more should be asked and discussed before we implement it.

You're right! thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expected Runtime Plugin for Soft Eviction via Requeue Action

2 participants