Pulse API Contract

Protocol Version: 2

Defines the REST and WebSocket protocol between the Pulse UI and Pulse Agent. Both repos must implement the same protocol version for compatibility.

Source of truth for message schemas. When adding or changing a message type, update this file first, then implement in both repos.

REST Endpoints

Method	Path	Auth	Description
`GET`	`/healthz`	public	Liveness probe. Returns `{"status": "ok"}`
`GET`	`/version`	public	Protocol version, agent version (dynamic from package), tool count, feature flags
`GET`	`/health`	token	Circuit breaker state, error summary, investigation stats, autofix_paused status
`GET`	`/tools`	token	All tools grouped by mode (sre, security) with `requires_confirmation` flags
`GET`	`/fix-history`	token	Paginated fix history with filters (`status`, `category`, `since`, `search`)
`GET`	`/fix-history/{id}`	token	Single action detail with before/after state
`POST`	`/fix-history/{id}/rollback`	token	Attempt rollback (returns error — rollback not currently supported)
`GET`	`/eval/status`	token	Cached quality gate snapshot (release, safety, integration, outcomes)
`GET`	`/predictions`	token	Returns empty — predictions are WebSocket-only (`/ws/monitor`)
`GET`	`/memory/export`	token	Export learned runbooks and patterns as JSON
`POST`	`/memory/import`	token	Import runbooks and patterns from another pod's export
`GET`	`/monitor/capabilities`	token	Max trust level and supported auto-fix categories
`POST`	`/monitor/pause`	token	Emergency kill switch — pause all auto-fix actions
`POST`	`/monitor/resume`	token	Resume auto-fix actions after a pause
`GET`	`/context`	token	View recent shared context bus entries across all agents
`GET`	`/skills`	token	List all skills with routing rules and metadata
`GET`	`/skills/{name}`	token	Get skill detail (prompt, tools, routing, versions)
`PUT`	`/admin/skills/{name}`	token	Edit skill (prompt, tools, routing rules)
`DELETE`	`/admin/skills/{name}`	token	Delete a skill
`POST`	`/admin/skills/{name}/clone`	token	Clone a skill with a new name
`POST`	`/admin/skills/test`	token	Test routing — returns which skill matches a given query
`GET`	`/admin/skills/{name}/versions`	token	Version history for a skill
`GET`	`/admin/skills/{name}/diff`	token	Diff between two skill versions
`POST`	`/admin/mcp/toolsets`	token	Toggle MCP toolsets on/off
`GET`	`/components`	token	Component registry — list all 19 component kinds with schemas

Authentication: Token-authenticated endpoints accept Authorization: Bearer <token> header or ?token=<token> query parameter. The token is PULSE_AGENT_WS_TOKEN. Unauthenticated requests return 401.

`/version` Response

{
  "protocol": "2",
  "agent": "1.5.0",
  "tools": 109,
  "features": ["component_specs", "ws_token_auth", "rate_limiting", "monitor", "fix_history", "predictions"]
}

The agent version is read dynamically from the installed package metadata. The tools count is the sum of SRE + Security tools.

`/health` Response

{
  "status": "ok",
  "circuit_breaker": {
    "state": "closed",
    "failure_count": 0,
    "recovery_timeout": 60
  },
  "errors": {
    "total": 0,
    "by_category": {},
    "recent": []
  },
  "investigations": {},
  "autofix_paused": false
}

`/tools` Response

{
  "sre": [
    {"name": "list_pods", "description": "...", "requires_confirmation": false},
    {"name": "delete_pod", "description": "...", "requires_confirmation": true}
  ],
  "security": [
    {"name": "scan_pod_security", "description": "...", "requires_confirmation": false}
  ],
  "write_tools": ["apply_yaml", "cordon_node", "delete_pod", "..."]
}

WebSocket Endpoints

Path	Auth	Description
`/ws/sre?token=...`	token	SRE agent chat
`/ws/security?token=...`	token	Security scanner chat
`/ws/monitor?token=...`	token	Autonomous cluster monitoring (Protocol v2)
`/ws/agent?token=...`	token	Auto-routing orchestrated agent — classifies intent per message and routes to SRE or Security

All WebSocket endpoints require PULSE_AGENT_WS_TOKEN via the token query parameter. Connections without a valid token are closed with code 4001.

Chat Protocol (`/ws/sre`, `/ws/security`, `/ws/agent`)

Client-to-Server Messages

`message` — Send a chat message

{
  "type": "message",
  "content": "Why are pods crash-looping in production?",
  "context": {
    "kind": "Deployment",
    "name": "api-server",
    "namespace": "production",
    "gvr": "apps~v1~deployments"
  },
  "fleet": false
}

Field	Type	Required	Description
`type`	`"message"`	yes
`content`	`string`	yes	User's message text
`context`	`ResourceContext`	no	Resource the user is viewing
`fleet`	`boolean`	no	Enable fleet/multi-cluster mode

`ResourceContext`

Field	Type	Required	Description
`kind`	`string`	yes	K8s resource kind (e.g., `"Deployment"`)
`name`	`string`	yes	Resource name
`namespace`	`string`	no	Resource namespace (omit for cluster-scoped)
`gvr`	`string`	no	GVR key (`group~version~plural`)

`confirm_response` — Respond to a confirmation request

{
  "type": "confirm_response",
  "approved": true,
  "nonce": "abc123..."
}

Field	Type	Required	Description
`type`	`"confirm_response"`	yes
`approved`	`boolean`	yes	Whether the user approved the action
`nonce`	`string`	yes	Must match the nonce from `confirm_request` (replay prevention)

`clear` — Clear conversation history

{
  "type": "clear"
}

Server-to-Client Events

`text_delta` — Streaming text chunk

{
  "type": "text_delta",
  "text": "The pods are crash-looping because"
}

`thinking_delta` — Streaming thinking/reasoning chunk

{
  "type": "thinking_delta",
  "thinking": "Let me check the pod logs first..."
}

`tool_use` — Tool execution started

{
  "type": "tool_use",
  "tool": "get_pod_logs"
}

`component` — Structured UI component from tool result

{
  "type": "component",
  "tool": "list_pods",
  "spec": {
    "kind": "data_table",
    "title": "Pods in production",
    "columns": [
      {"id": "name", "header": "Name"},
      {"id": "status", "header": "Status"}
    ],
    "rows": [
      {"name": "api-server-abc", "status": "Running"}
    ]
  }
}

See Component Specs for all spec.kind values.

`confirm_request` — Request user confirmation for a dangerous action

{
  "type": "confirm_request",
  "tool": "delete_resource",
  "input": {"kind": "Pod", "name": "my-pod", "namespace": "default"},
  "nonce": "abc123..."
}

Field	Type	Description
`tool`	`string`	Tool name requiring confirmation
`input`	`object`	Tool input parameters (shown to user)
`nonce`	`string`	JIT nonce for replay prevention — client must echo this back

`done` — Agent turn complete

{
  "type": "done",
  "full_response": "The pods are crash-looping because..."
}

`error` — Error message

{
  "type": "error",
  "message": "Rate limited. Max 10 messages per minute."
}

`cleared` — Conversation history cleared

{
  "type": "cleared"
}

Monitor Protocol (`/ws/monitor`)

Client-to-Server Messages

`subscribe_monitor` — Subscribe to cluster monitoring

Sent as the first message after connecting to /ws/monitor. Configures the monitoring session.

{
  "type": "subscribe_monitor",
  "trustLevel": 1,
  "autoFixCategories": ["crash_loop", "resource_pressure"]
}

Field	Type	Required	Description
`type`	`"subscribe_monitor"`	yes
`trustLevel`	`integer`	no	Autonomous action trust level (0-4). Clamped to server-configured max. Default: `1`
`autoFixCategories`	`string[]`	no	Categories the agent may auto-fix without prompting

`trigger_scan` — Trigger an immediate cluster scan

{
  "type": "trigger_scan"
}

Triggers an immediate cluster scan. If a scan is already in progress, returns an error. Results are pushed as finding and monitor_status events.

`action_response` — Respond to an autonomous action proposal

{
  "type": "action_response",
  "actionId": "abc123",
  "approved": true
}

Field	Type	Required	Description
`type`	`"action_response"`	yes
`actionId`	`string`	yes	ID of the proposed action
`approved`	`boolean`	yes	Whether the user approved the action

`get_fix_history` — Request fix history

{
  "type": "get_fix_history",
  "page": 1,
  "filters": {"status": "applied", "category": "crash_loop"}
}

Field	Type	Required	Description
`type`	`"get_fix_history"`	yes
`page`	`integer`	no	Page number (default: `1`)
`filters`	`object`	no	Optional filters (`status`, `category`, `since`, `search`)

Server-to-Client Events

`finding` — Cluster issue detected

{
  "type": "finding",
  "id": "f-abc123",
  "severity": "warning",
  "category": "crash_loop",
  "resource": {"kind": "Pod", "name": "api-server-xyz", "namespace": "production"},
  "summary": "Pod crash-looping: CrashLoopBackOff (5 restarts in 10m)",
  "details": "...",
  "timestamp": 1711540800
}

`prediction` — Predicted future issue

{
  "type": "prediction",
  "id": "p-abc123",
  "category": "resource_pressure",
  "resource": {"kind": "Node", "name": "worker-03"},
  "summary": "Node memory predicted to exceed 90% within 2 hours",
  "confidence": 0.87,
  "horizon": "2h",
  "timestamp": 1711540800
}

`action_report` — Result of an autonomous or approved action

{
  "type": "action_report",
  "actionId": "a-abc123",
  "findingId": "f-abc123",
  "action": "restart_pod",
  "status": "applied",
  "summary": "Restarted pod api-server-xyz",
  "before": {},
  "after": {},
  "timestamp": 1711540800
}

action_report may include optional verification fields once post-fix verification completes:

verificationStatus: "verified" | "still_failing"
verificationEvidence: string
verificationTimestamp: number

`investigation_report` — Proactive root-cause analysis for critical findings

{
  "type": "investigation_report",
  "id": "i-abc123",
  "findingId": "f-abc123",
  "category": "crashloop",
  "status": "completed",
  "summary": "Crashloop due to missing ConfigMap key",
  "suspectedCause": "ConfigMap key removed in recent rollout",
  "recommendedFix": "Restore key and restart deployment",
  "confidence": 0.82,
  "timestamp": 1711540800
}

`verification_report` — Next-scan validation after a fix action

{
  "type": "verification_report",
  "id": "v-abc123",
  "actionId": "a-abc123",
  "findingId": "f-abc123",
  "status": "verified",
  "evidence": "No active crashloop findings for affected resources",
  "timestamp": 1711540800
}

`findings_snapshot` — Active findings reconciliation

Sent after each scan cycle. Contains the IDs of all currently active findings. The UI removes any locally-held findings whose IDs are not in activeIds, preventing stale entries from accumulating after issues are resolved.

{
  "type": "findings_snapshot",
  "activeIds": ["f-abc123", "f-def456"],
  "timestamp": 1711540800
}

Field	Type	Description
`activeIds`	`string[]`	IDs of all findings that are still active
`timestamp`	`number`	Unix timestamp of the snapshot

`monitor_status` — Scan cycle status update

{
  "type": "monitor_status",
  "activeWatches": ["crashloop", "pending", "workloads", "nodes", "cert_expiry", "alerts", "oom", "image_pull", "operators", "daemonsets", "hpa"],
  "lastScan": 1711540800,
  "findingsCount": 3,
  "nextScan": 1711540860
}

`fix_history` — Response to `get_fix_history`

{
  "type": "fix_history",
  "items": [],
  "total": 0,
  "page": 1,
  "pageSize": 20
}

`error` — Rate limit or other errors

{
  "type": "error",
  "message": "Rate limited. Max 10 messages per minute."
}

Agent Protocol (`/ws/agent`)

The /ws/agent endpoint uses the same client-to-server and server-to-client message types as the chat protocol (/ws/sre, /ws/security). The difference is that each incoming message is classified by an intent classifier (orchestrator.py) and automatically routed to the appropriate agent (SRE or Security) with the correct system prompt and tool set.

Client-to-Server Messages

message: {type, content, context?, fleet?} — same as chat protocol
confirm_response: {type, approved, nonce} — same as chat protocol
clear: {type} — clears conversation history

Server-to-Client Events

text_delta, thinking_delta, tool_use, component, confirm_request (with nonce), done, error, cleared — same as chat protocol

Component Specs

Structured UI components returned by agent tools via the component event. The UI renders these inline in the chat.

`kind`	Description	Key Fields
`data_table`	Sortable table	`columns[]`, `rows[]`
`info_card_grid`	Metric cards	`cards[]{label, value, sub?}`
`badge_list`	Colored badges	`badges[]{text, variant}`
`status_list`	Health status items	`items[]{name, status, detail?}`
`key_value`	Key-value pairs	`pairs[]{key, value}`
`chart`	Time-series chart	`series[]{label, data[][], color?}`
`tabs`	Tabbed content	`tabs[]{label, content: ComponentSpec}`
`grid`	Grid layout	`columns`, `items: ComponentSpec[]`
`section`	Titled section	`title`, `content: ComponentSpec`

Badge Variants

success | warning | error | info | default

Status Values

healthy | warning | error | pending | unknown

Constraints

Constraint	Value	Enforced By
Max message size	1 MB	Agent
Rate limit	10 messages/minute per connection	Agent
Confirmation timeout	120 seconds	Agent
Pending confirmation TTL	5 minutes	Agent
Context field validation	`^[a-zA-Z0-9\-._/: ]{0,253}$`	Agent
Reconnect attempts	5 max, exponential backoff + jitter	UI

Version Compatibility

The UI sends a GET /version request before connecting. If the agent's protocol field doesn't match the UI's EXPECTED_PROTOCOL, the UI shows a warning but still connects (graceful degradation).

Protocol Version History

Version	Changes	UI Version	Agent Version
`2`	`/ws/monitor` for autonomous scanning, `/ws/agent` for auto-routing orchestration, `subscribe_monitor` / `trigger_scan` / `action_response` / `get_fix_history` client messages, `finding` / `prediction` / `action_report` / `investigation_report` / `verification_report` / `findings_snapshot` / `monitor_status` server events, fix history / predictions / memory / context REST endpoints, monitor pause/resume, nonce-based confirmation replay prevention	v5.12.0+	v1.4.0+
`1`	Initial protocol: text/thinking streaming, tool use, components, confirmations	v5.0.0+	v1.0.0+

Release Compatibility Matrix

UI Version	Agent Version	Protocol	Status
v6.2.0	v2.3.0	2	Current
v6.1.0	v2.2.0	2	Compatible
v6.0.0	v2.0.0	2	Compatible
v5.21.0	v1.16.0	2	Compatible
v5.13.0	v1.5.0	2	Compatible
v5.12.0	v1.4.0	2	Compatible
v5.11.0	v1.3.0	1	Compatible
v5.10.0	v1.3.0	1	Compatible
v5.8.0	v1.2.0	1	Compatible
v5.0.0-v5.7.0	v1.0.0-v1.1.0	1	Compatible

Both repos should tag releases together when protocol changes occur. Minor UI/Agent releases within the same protocol version are always compatible.

FilesExpand file tree

API_CONTRACT.md

Latest commit

History

API_CONTRACT.md

File metadata and controls

Pulse API Contract

REST Endpoints

/version Response

/health Response

/tools Response

WebSocket Endpoints

Chat Protocol (/ws/sre, /ws/security, /ws/agent)

Client-to-Server Messages

message — Send a chat message

ResourceContext

confirm_response — Respond to a confirmation request

clear — Clear conversation history

Server-to-Client Events

text_delta — Streaming text chunk

thinking_delta — Streaming thinking/reasoning chunk

tool_use — Tool execution started

component — Structured UI component from tool result

confirm_request — Request user confirmation for a dangerous action

done — Agent turn complete

error — Error message

cleared — Conversation history cleared

Monitor Protocol (/ws/monitor)

Client-to-Server Messages

subscribe_monitor — Subscribe to cluster monitoring

trigger_scan — Trigger an immediate cluster scan

action_response — Respond to an autonomous action proposal

get_fix_history — Request fix history

Server-to-Client Events

finding — Cluster issue detected

prediction — Predicted future issue

action_report — Result of an autonomous or approved action

investigation_report — Proactive root-cause analysis for critical findings

verification_report — Next-scan validation after a fix action

findings_snapshot — Active findings reconciliation

monitor_status — Scan cycle status update

fix_history — Response to get_fix_history

error — Rate limit or other errors

Agent Protocol (/ws/agent)

Client-to-Server Messages

Server-to-Client Events

Component Specs

Badge Variants

Status Values

Constraints

Version Compatibility

Protocol Version History

Release Compatibility Matrix

`/version` Response

`/health` Response

`/tools` Response

Chat Protocol (`/ws/sre`, `/ws/security`, `/ws/agent`)

`message` — Send a chat message

`ResourceContext`

`confirm_response` — Respond to a confirmation request

`clear` — Clear conversation history

`text_delta` — Streaming text chunk

`thinking_delta` — Streaming thinking/reasoning chunk

`tool_use` — Tool execution started

`component` — Structured UI component from tool result

`confirm_request` — Request user confirmation for a dangerous action

`done` — Agent turn complete

`error` — Error message

`cleared` — Conversation history cleared

Monitor Protocol (`/ws/monitor`)

`subscribe_monitor` — Subscribe to cluster monitoring

`trigger_scan` — Trigger an immediate cluster scan

`action_response` — Respond to an autonomous action proposal

`get_fix_history` — Request fix history

`finding` — Cluster issue detected

`prediction` — Predicted future issue

`action_report` — Result of an autonomous or approved action

`investigation_report` — Proactive root-cause analysis for critical findings

`verification_report` — Next-scan validation after a fix action

`findings_snapshot` — Active findings reconciliation

`monitor_status` — Scan cycle status update

`fix_history` — Response to `get_fix_history`

`error` — Rate limit or other errors

Agent Protocol (`/ws/agent`)