Status & Conditions Model for AIM Resources
In order to provide a consistent, predictable, and machine-readable status surface across all AIM resources, we need a clear definition of:
- How phases (
status.status) behave
- What condition types exist
- How child resource health is surfaced
- How various classes of issues (spec, auth, infra, capacity) are exposed
This issue proposes a unified model.
1. Phases (status.status)
All AIM resources standardize on constants.AIMStatus as their phase:
Pending
Starting
Ready / Running
Degraded
Failed
NotAvailable
Rules:
- Phases are computed centrally by a State Engine.
- Domains should not set this by hand except in advanced/override scenarios.
- If a resource uses only a subset (e.g., no
Running), we constrain via Kubebuilder enums.
2. Conditions Model
AIM resources expose two categories of conditions:
- Per-child conditions
- Parent-level conditions
The model is designed to:
- Give users detailed breadcrumbs when debugging
- Provide machine-readable signals for automation
- Keep naming consistent across controllers
2.1 Per-Child Conditions {ChildName}Ready
For every child resource a parent tracks (e.g. Model, Template, InferenceService, Cache), we expose:
Behavior
-
Status=True
→ The child is fully ready.
-
Status=False
→ The child is progressing, degraded, failed, or its health cannot currently be determined.
Reason
The reason field reflects the child's current health state, and should include the generic phase when everything is going as expected (Starting, Ready), but be more detailed when there's an identified issue (ImagePullError).
Message
- The
message is taken directly from the child resource's own conditions/messages wherever possible.
- This allows users to trace issues without kubectl-hunting pod logs.
2.2 Parent-Level Conditions
Every AIM resource also publishes a small, standardized set of top-level conditions:
1) Ready
Represents overall readiness of the resource.
Ready=True
→ The resource is usable.
Ready=False
→ Look at reason + other conditions for details.
This is the primary high-level signal.
2) ConfigValid
Represents validity of the resource's spec.
ConfigValid=False → Invalid spec fields, incompatible combinations, etc.
The phase will be Failed.
This isolates user errors from runtime/system errors.
3) AuthValid
Represents validity of authentication/authorization, including:
- Missing or malformed secrets
- Invalid registry credentials
- Unauthorized remote API calls
- Image pull authentication failures
AuthValid=False → The phase becomes Degraded until fixed.
4) DependenciesReachable
Represents whether the controller can reach its required external dependencies.
This includes:
- Remote APIs (model registry, inference APIs, template providers)
- Storage endpoints
- Registry endpoints (excluding pure auth failures)
- Any external service the controller must fetch state from
Interpretation:
-
DependenciesReachable=True
→ All required external dependencies responded normally.
-
DependenciesReachable=False
→ Connectivity-like issues (timeouts, DNS failures, 5xx errors, refused connections).
Degradation logic for this condition
Transient issues do not immediately degrade the resource:
- If the resource is in
Pending, this stays informational.
- Once in
Starting/Ready, a timer-based threshold (e.g. 10 seconds) must pass with DependenciesReachable=False before phase transitions into Degraded.
This prevents short-lived network hiccups from causing noisy readiness flapping.
3. How Issue Classes Surface Across Conditions
This is the core of the model:
3.1 Spec errors
ConfigValid=False
Ready=False
Phase=Failed
- Relevant child conditions (if applicable):
{ChildName}Ready=False, Reason=InvalidSpec
3.2 Auth errors
AuthValid=False
Ready=False, Reason=AuthFailed
Phase=Degraded
- Child:
{ChildName}Ready=False, Reason=AuthFailed
3.3 External connectivity errors
DependenciesReachable=False
Ready=False, Reason=DependenciesUnreachable (or similar)
Phase=Starting → Degraded after threshold
- Child health:
State=Unknown for components whose health could not be evaluated
→ {ChildName}Ready=False, Reason=Unknown
3.4 Capacity/pending issues (e.g., no GPUs)
- Not an error; a form of progressing.
Ready=False, Reason=PendingResources or InsufficientCapacity
Phase=Starting (or Degraded if persistently blocked)
- Child shows:
{ChildName}Ready=False, Reason=InsufficientCapacity
4. Summary of Final Conditions Set
Child-level
{ChildName}Ready (True/False)
Parent-level
Ready
ConfigValid
AuthValid
DependenciesReachable
Everything else (e.g. workload behavior, missing dependencies, infra health) is expressed through:
- Phase transitions (Pending → Starting → Ready → Degraded → Failed)
Reason / Message fields of these standard conditions
{ChildName}Ready conditions reflecting detailed component-level health
5. Benefits of This Model
- Fully machine-parseable structure
- Minimal cognitive overhead for clients (just 4 parent-level conditions)
- Consistent diagnostics across all AIM resources
- Clear separation between:
- user mistakes (ConfigValid)
- auth issues (AuthValid)
- remote problems (DependenciesReachable)
- capacity/progress issues (child Ready conditions)
- No condition flapping due to transient infra issues
- Perfectly matches a centralized StateEngine design
Status & Conditions Model for AIM Resources
In order to provide a consistent, predictable, and machine-readable status surface across all AIM resources, we need a clear definition of:
status.status) behaveThis issue proposes a unified model.
1. Phases (
status.status)All AIM resources standardize on
constants.AIMStatusas their phase:PendingStartingReady/RunningDegradedFailedNotAvailableRules:
Running), we constrain via Kubebuilder enums.2. Conditions Model
AIM resources expose two categories of conditions:
The model is designed to:
2.1 Per-Child Conditions
{ChildName}ReadyFor every child resource a parent tracks (e.g. Model, Template, InferenceService, Cache), we expose:
Behavior
Status=True→ The child is fully ready.
Status=False→ The child is progressing, degraded, failed, or its health cannot currently be determined.
Reason
The
reasonfield reflects the child's current health state, and should include the generic phase when everything is going as expected (Starting, Ready), but be more detailed when there's an identified issue (ImagePullError).Message
messageis taken directly from the child resource's own conditions/messages wherever possible.2.2 Parent-Level Conditions
Every AIM resource also publishes a small, standardized set of top-level conditions:
1) Ready
Represents overall readiness of the resource.
This is the primary high-level signal.
2) ConfigValid
Represents validity of the resource's spec.
ConfigValid=False→ Invalid spec fields, incompatible combinations, etc.The phase will be
Failed.This isolates user errors from runtime/system errors.
3) AuthValid
Represents validity of authentication/authorization, including:
AuthValid=False→ The phase becomesDegradeduntil fixed.4) DependenciesReachable
Represents whether the controller can reach its required external dependencies.
This includes:
Interpretation:
DependenciesReachable=True→ All required external dependencies responded normally.
DependenciesReachable=False→ Connectivity-like issues (timeouts, DNS failures, 5xx errors, refused connections).
Degradation logic for this condition
Transient issues do not immediately degrade the resource:
Pending, this stays informational.Starting/Ready, a timer-based threshold (e.g. 10 seconds) must pass withDependenciesReachable=Falsebefore phase transitions intoDegraded.This prevents short-lived network hiccups from causing noisy readiness flapping.
3. How Issue Classes Surface Across Conditions
This is the core of the model:
3.1 Spec errors
ConfigValid=FalseReady=FalsePhase=Failed{ChildName}Ready=False,Reason=InvalidSpec3.2 Auth errors
AuthValid=FalseReady=False,Reason=AuthFailedPhase=Degraded{ChildName}Ready=False,Reason=AuthFailed3.3 External connectivity errors
DependenciesReachable=FalseReady=False,Reason=DependenciesUnreachable(or similar)Phase=Starting → Degradedafter thresholdState=Unknownfor components whose health could not be evaluated→
{ChildName}Ready=False,Reason=Unknown3.4 Capacity/pending issues (e.g., no GPUs)
Ready=False,Reason=PendingResourcesorInsufficientCapacityPhase=Starting(orDegradedif persistently blocked){ChildName}Ready=False,Reason=InsufficientCapacity4. Summary of Final Conditions Set
Child-level
Parent-level
Everything else (e.g. workload behavior, missing dependencies, infra health) is expressed through:
Reason/Messagefields of these standard conditions{ChildName}Readyconditions reflecting detailed component-level health5. Benefits of This Model