[AIM] Consistent conditions for all resources

# Status & Conditions Model for AIM Resources

In order to provide a **consistent, predictable, and machine-readable status surface** across all AIM resources, we need a clear definition of:

- How phases (`status.status`) behave
- What condition types exist
- How child resource health is surfaced
- How various classes of issues (spec, auth, infra, capacity) are exposed

This issue proposes a unified model.

---

## 1. Phases (`status.status`)

All AIM resources standardize on `constants.AIMStatus` as their **phase**:

- `Pending`
- `Starting`
- `Ready` / `Running`
- `Degraded`
- `Failed`
- `NotAvailable`

Rules:

- Phases are computed centrally by a State Engine.
- Domains **should not** set this by hand except in advanced/override scenarios.
- If a resource uses only a subset (e.g., no `Running`), we constrain via Kubebuilder enums.

---

## 2. Conditions Model

AIM resources expose **two categories** of conditions:

1. **Per-child conditions**
2. **Parent-level conditions**

The model is designed to:

- Give users detailed breadcrumbs when debugging
- Provide machine-readable signals for automation
- Keep naming consistent across controllers

---

## 2.1 Per-Child Conditions `{ChildName}Ready`

For every child resource a parent tracks (e.g. Model, Template, InferenceService, Cache), we expose:

```
{ChildName}Ready
```

### Behavior

- `Status=True`
  → The child is fully ready.

- `Status=False`
  → The child is progressing, degraded, failed, or its health cannot currently be determined.

### Reason

The `reason` field reflects the **child's current health state**, and should include the generic phase when everything is going as expected (Starting, Ready), but be more detailed when there's an identified issue (ImagePullError).

### Message

- The `message` is taken directly from the child resource's own conditions/messages wherever possible.
- This allows users to trace issues without kubectl-hunting pod logs.

---

## 2.2 Parent-Level Conditions

Every AIM resource also publishes a small, **standardized** set of top-level conditions:

### 1) Ready

Represents **overall readiness** of the resource.

```
Ready=True
→ The resource is usable.
```

```
Ready=False
→ Look at reason + other conditions for details.
```

This is the primary high-level signal.

---

### 2) ConfigValid

Represents **validity of the resource's spec**.

- `ConfigValid=False` → Invalid spec fields, incompatible combinations, etc.
  The phase will be `Failed`.

This isolates *user errors* from runtime/system errors.

---

### 3) AuthValid

Represents **validity of authentication/authorization**, including:

- Missing or malformed secrets
- Invalid registry credentials
- Unauthorized remote API calls
- Image pull authentication failures

`AuthValid=False` → The phase becomes `Degraded` until fixed.

---

### 4) DependenciesReachable

Represents whether the controller can reach its required **external dependencies**.

This includes:

- Remote APIs (model registry, inference APIs, template providers)
- Storage endpoints
- Registry endpoints (excluding pure auth failures)
- Any external service the controller must fetch state from

**Interpretation:**

- `DependenciesReachable=True`
  → All required external dependencies responded normally.

- `DependenciesReachable=False`
  → Connectivity-like issues (timeouts, DNS failures, 5xx errors, refused connections).

### Degradation logic for this condition

Transient issues do *not* immediately degrade the resource:

- If the resource is in `Pending`, this stays informational.
- Once in `Starting`/`Ready`, a timer-based threshold (e.g. 10 seconds) must pass with `DependenciesReachable=False` before phase transitions into `Degraded`.

This prevents short-lived network hiccups from causing noisy readiness flapping.

---

## 3. How Issue Classes Surface Across Conditions

This is the core of the model:

### 3.1 Spec errors

- `ConfigValid=False`
- `Ready=False`
- `Phase=Failed`
- Relevant child conditions (if applicable): `{ChildName}Ready=False`, `Reason=InvalidSpec`

### 3.2 Auth errors

- `AuthValid=False`
- `Ready=False`, `Reason=AuthFailed`
- `Phase=Degraded`
- Child: `{ChildName}Ready=False`, `Reason=AuthFailed`

### 3.3 External connectivity errors

- `DependenciesReachable=False`
- `Ready=False`, `Reason=DependenciesUnreachable` (or similar)
- `Phase=Starting → Degraded` after threshold
- Child health:
  - `State=Unknown` for components whose health could not be evaluated
    → `{ChildName}Ready=False`, `Reason=Unknown`

### 3.4 Capacity/pending issues (e.g., no GPUs)

- Not an error; a form of progressing.
- `Ready=False`, `Reason=PendingResources` or `InsufficientCapacity`
- `Phase=Starting` (or `Degraded` if persistently blocked)
- Child shows:
  - `{ChildName}Ready=False`, `Reason=InsufficientCapacity`

---

## 4. Summary of Final Conditions Set

### Child-level

```
{ChildName}Ready   (True/False)
```

### Parent-level

```
Ready
ConfigValid
AuthValid
DependenciesReachable
```

Everything else (e.g. workload behavior, missing dependencies, infra health) is expressed through:

- Phase transitions (Pending → Starting → Ready → Degraded → Failed)
- `Reason` / `Message` fields of these standard conditions
- `{ChildName}Ready` conditions reflecting detailed component-level health

---

## 5. Benefits of This Model

- Fully machine-parseable structure
- Minimal cognitive overhead for clients (just 4 parent-level conditions)
- Consistent diagnostics across all AIM resources
- Clear separation between:
  - user mistakes (ConfigValid)
  - auth issues (AuthValid)
  - remote problems (DependenciesReachable)
  - capacity/progress issues (child Ready conditions)
- No condition flapping due to transient infra issues
- Perfectly matches a centralized StateEngine design

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIM] Consistent conditions for all resources #411

Status & Conditions Model for AIM Resources

1. Phases (`status.status`)

2. Conditions Model

2.1 Per-Child Conditions `{ChildName}Ready`

Behavior

Reason

Message

2.2 Parent-Level Conditions

1) Ready

2) ConfigValid

3) AuthValid

4) DependenciesReachable

Degradation logic for this condition

3. How Issue Classes Surface Across Conditions

3.1 Spec errors

3.2 Auth errors

3.3 External connectivity errors

3.4 Capacity/pending issues (e.g., no GPUs)

4. Summary of Final Conditions Set

Child-level

Parent-level

5. Benefits of This Model

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AIM] Consistent conditions for all resources #411

Description

Status & Conditions Model for AIM Resources

1. Phases (status.status)

2. Conditions Model

2.1 Per-Child Conditions {ChildName}Ready

Behavior

Reason

Message

2.2 Parent-Level Conditions

1) Ready

2) ConfigValid

3) AuthValid

4) DependenciesReachable

Degradation logic for this condition

3. How Issue Classes Surface Across Conditions

3.1 Spec errors

3.2 Auth errors

3.3 External connectivity errors

3.4 Capacity/pending issues (e.g., no GPUs)

4. Summary of Final Conditions Set

Child-level

Parent-level

5. Benefits of This Model

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Phases (`status.status`)

2.1 Per-Child Conditions `{ChildName}Ready`