Currently, when an INFERENCE session (vLLM, TGI, NIM, SGLang, etc.) is created without a model virtual folder (VFolderUsageMode.MODEL), the error is only caught on the Agent side (agent.py:3303 ModelFolderNotSpecifiedError) after the RPC has already been dispatched.
This causes unnecessary RPC traffic and repeated failures. In Dogbowl, this resulted in ~900 failed RPC calls per hour sustained over 2 days.
The Manager's SessionValidator (sokovan/scheduling_controller/validators/) has rules for container limits, service ports, resource limits, and mount names, but no rule to validate that inference sessions include at least one model-type vfolder.
Implementation:
- Add a new SessionValidatorRule (e.g. InferenceModelFolderRule) in validators/inference.py
- When session_type == INFERENCE and runtime_variant != CUSTOM, require at least one mount with usage_mode MODEL
- SessionCreationSpec already has session_type (line 86) and creation_spec with mounts/runtime_variant (lines 172-176)
- Register the new rule in scheduling_controller.py alongside existing rules (line 117-122)
- Export from validators/init.py
Files:
- NEW: sokovan/scheduling_controller/validators/inference.py
- MOD: sokovan/scheduling_controller/validators/init.py
- MOD: sokovan/scheduling_controller/scheduling_controller.py
JIRA Issue: BA-4816
Currently, when an INFERENCE session (vLLM, TGI, NIM, SGLang, etc.) is created without a model virtual folder (VFolderUsageMode.MODEL), the error is only caught on the Agent side (agent.py:3303 ModelFolderNotSpecifiedError) after the RPC has already been dispatched.
This causes unnecessary RPC traffic and repeated failures. In Dogbowl, this resulted in ~900 failed RPC calls per hour sustained over 2 days.
The Manager's SessionValidator (sokovan/scheduling_controller/validators/) has rules for container limits, service ports, resource limits, and mount names, but no rule to validate that inference sessions include at least one model-type vfolder.
Implementation:
Files:
JIRA Issue: BA-4816