Skip to content

Add model folder pre-validation for inference sessions in Manager scheduler #9556

@HyeockJinKim

Description

@HyeockJinKim

Currently, when an INFERENCE session (vLLM, TGI, NIM, SGLang, etc.) is created without a model virtual folder (VFolderUsageMode.MODEL), the error is only caught on the Agent side (agent.py:3303 ModelFolderNotSpecifiedError) after the RPC has already been dispatched.

This causes unnecessary RPC traffic and repeated failures. In Dogbowl, this resulted in ~900 failed RPC calls per hour sustained over 2 days.

The Manager's SessionValidator (sokovan/scheduling_controller/validators/) has rules for container limits, service ports, resource limits, and mount names, but no rule to validate that inference sessions include at least one model-type vfolder.

Implementation:

  • Add a new SessionValidatorRule (e.g. InferenceModelFolderRule) in validators/inference.py
  • When session_type == INFERENCE and runtime_variant != CUSTOM, require at least one mount with usage_mode MODEL
  • SessionCreationSpec already has session_type (line 86) and creation_spec with mounts/runtime_variant (lines 172-176)
  • Register the new rule in scheduling_controller.py alongside existing rules (line 117-122)
  • Export from validators/init.py

Files:

  • NEW: sokovan/scheduling_controller/validators/inference.py
  • MOD: sokovan/scheduling_controller/validators/init.py
  • MOD: sokovan/scheduling_controller/scheduling_controller.py

JIRA Issue: BA-4816

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Story.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions