Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,7 @@ See `docs/concepts/operating-modes.md` for the full guide, VS Code config, and M
| `packages/sdk/src/audit/index.ts` | Audit rules engine |
| `packages/sdk/src/generate/index.ts` | Adapter registry |
| `packages/adapter-langgraph/src/index.ts` | LangGraph adapter (auto-registers) |
| `packages/otel/src/provider.ts` | OTel TracerProvider + OTLP exporter setup |
| `packages/cli/src/cli.ts` | CLI entrypoint |
| `examples/gymcoach/agent.yaml` | Full GymCoach manifest example |

Expand Down
4 changes: 3 additions & 1 deletion docs/.vitepress/config.mts
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ export default defineConfig({
{ text: 'Runtime Introspection', link: '/concepts/runtime-introspection' },
{ text: 'Compliance', link: '/concepts/compliance' },
{ text: 'Probe Coverage', link: '/concepts/probe-coverage' },
{ text: 'Observability', link: '/concepts/observability' },
{ text: 'OPA Policies', link: '/concepts/opa' },
{ text: 'Adapters', link: '/concepts/adapters' },
],
Expand All @@ -60,7 +61,8 @@ export default defineConfig({
items: [
{ text: 'Add Tools', link: '/guides/add-tools' },
{ text: 'Add Memory', link: '/guides/add-memory' },
{ text: 'Add Guardrails', link: '/guides/add-guardrails' },
{ text: 'Add Guardrails', link: '/guides/add-guardrails' },
{ text: 'Add Observability', link: '/guides/add-observability' },
],
},
{
Expand Down
109 changes: 109 additions & 0 deletions docs/concepts/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Observability

AgentSpec provides native OpenTelemetry trace export so you can see exactly what your agent infrastructure is doing -- every proxied request, health check, and heartbeat -- in your existing observability stack.

## How It Works

```
agent.yaml
spec.observability.tracing.backend: otel
|
v
@agentspec/otel
(shared TracerProvider)
/ \
v v
Sidecar Proxy SDK Reporter
(request spans) (health spans)
| |
+-- traceparent ---->+
| |
v v
OTLP Exporter --> Jaeger / Grafana / Datadog
```

The **sidecar** creates a root span for every proxied request and injects a W3C `traceparent` header into the upstream call. The **SDK reporter** creates spans around health refreshes and heartbeat pushes. When both are running, spans from both layers join the same distributed trace.

## Declare Tracing in Your Manifest

```yaml
observability:
tracing:
backend: otel
endpoint: $env:OTEL_EXPORTER_OTLP_ENDPOINT
sampleRate: 1.0
metrics:
serviceName: my-agent
```

When `backend` is `otel`, the sidecar and SDK reporter automatically start exporting spans via OTLP. No extra config or env var toggle needed.

## What Gets Traced

### Sidecar spans

| Span name | When | Attributes |
|-----------|------|------------|
| `proxy:request` | Every proxied request | `http.method`, `http.url`, `http.status_code`, `http.request_id`, `agentspec.agent.name` |
| `cp:GET /health/ready` | Control plane requests | `http.method`, `http.url`, `http.status_code` |
| `cp:GET /explore` | | |
| `cp:GET /gap` | | |
| `events:ingest` | POST /events batch | `agentspec.events.request_id`, `agentspec.events.count`, `agentspec.events.opa_violations` |
| `explain:generate` | GET /explain/:id | `agentspec.explain.request_id` |

### SDK reporter spans

| Span name | When | Attributes |
|-----------|------|------------|
| `reporter:health-refresh` | Every health check cycle | `agentspec.health.status`, `agentspec.health.passed`, `agentspec.health.failed` |
| `reporter:heartbeat` | Every push-mode heartbeat | `agentspec.heartbeat.payload_bytes`, `http.status_code` |

## Distributed Tracing

The sidecar injects a W3C `traceparent` header into every request it forwards to your agent. If your agent uses the SDK reporter, its spans automatically become children of the sidecar's proxy span:

```
[proxy:request] POST /chat (sidecar, root)
+-- [upstream] POST http://agent:8000/chat
+-- [reporter:health-refresh] (agent process, child)
```

This gives you a single trace spanning both the infrastructure layer and the agent's internal operations.

## Configuration Reference

| Manifest field | Maps to | Default |
|---|---|---|
| `spec.observability.tracing.endpoint` | OTLP collector URL | `OTEL_EXPORTER_OTLP_ENDPOINT` env var, then `http://localhost:4318` |
| `spec.observability.tracing.sampleRate` | Fraction of requests traced (0.0-1.0) | `1.0` |
| `spec.observability.metrics.serviceName` | OTel resource `service.name` | `metadata.name` |

**Protocol**: Both HTTP/protobuf and gRPC are supported. Set `OTEL_EXPORTER_OTLP_PROTOCOL=grpc` for gRPC; the default is `http/protobuf`.

**Endpoint references**: The `endpoint` field supports `$env:`, `$secret:`, and `$file:` references, resolved via the standard SDK `resolveRef()`.

## Graceful Degradation

The SDK uses `@opentelemetry/api` directly, which returns no-op tracers when no provider is registered. This means:

- Agents that don't declare `backend: otel` pay zero overhead
- The SDK remains usable without `@agentspec/otel` installed (it's an optional dependency)
- No conditional imports or runtime feature detection needed

## Compliance Rules

Three audit rules check observability configuration:

| Rule | Check | Severity |
|------|-------|----------|
| OBS-01 | Tracing backend declared | Medium |
| OBS-02 | Structured logging enabled | Low |
| OBS-03 | Sensitive fields redacted from logs | Medium |

Run `agentspec audit agent.yaml` to check compliance.

## See also

- [Runtime Introspection](/concepts/runtime-introspection) -- live health reporting from inside your agent
- [Health Checks](/concepts/health-checks) -- pre-flight CLI checks
- [Add Runtime Health](/guides/add-runtime-health) -- integrate the SDK reporter
119 changes: 119 additions & 0 deletions docs/guides/add-observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Add OpenTelemetry Tracing

Export distributed traces from your AgentSpec sidecar and SDK reporter to any OTLP-compatible backend (Jaeger, Grafana Tempo, Honeycomb, Datadog).

## Prerequisites

- [ ] An `agent.yaml` manifest
- [ ] The AgentSpec sidecar running (`agentspec-sidecar`)
- [ ] An OTLP collector or compatible backend (e.g. Jaeger, Grafana Tempo)

## 1. Declare tracing in your manifest

Add the `observability.tracing` section to your `agent.yaml`:

```yaml
observability:
tracing:
backend: otel
endpoint: $env:OTEL_EXPORTER_OTLP_ENDPOINT
sampleRate: 1.0
metrics:
serviceName: my-agent
```

| Field | Required | Description |
|-------|----------|-------------|
| `backend` | Yes | Set to `otel` for OpenTelemetry export |
| `endpoint` | No | OTLP collector URL. Falls back to `OTEL_EXPORTER_OTLP_ENDPOINT` env var, then `http://localhost:4318` |
| `sampleRate` | No | Fraction of requests to trace (0.0-1.0). Default: `1.0` |
| `serviceName` | No | OTel `service.name` resource attribute. Default: `metadata.name` |

## 2. Set the endpoint environment variable

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
```

For gRPC backends (port 4317):

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
```

## 3. Start the sidecar

```bash
MANIFEST_PATH=./agent.yaml agentspec-sidecar
```

You should see in the logs:

```json
{"ts":"2026-04-12T...","level":"info","msg":"otel tracing initialized","endpoint":"http://localhost:4318","serviceName":"my-agent","sampleRate":1}
```

The sidecar now exports spans for every proxied request and control plane query.

## 4. Add tracing to your agent (optional)

If your agent uses `@agentspec/sdk`, the reporter automatically emits spans for health checks and heartbeats:

```typescript
import { AgentSpecReporter } from '@agentspec/sdk'

const reporter = new AgentSpecReporter(manifest, { refreshIntervalMs: 30_000 })
reporter.start()
```

No additional OTel setup is needed in the agent. The reporter uses `@opentelemetry/api` directly, which joins the sidecar's trace via the `traceparent` header.

## 5. Verify traces

### With Jaeger

```bash
docker run -d --name jaeger \
-p 16686:16686 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
```

Open http://localhost:16686, select your service name, and find traces. You should see:

- `proxy:request` spans for every proxied request
- `cp:GET /health/ready`, `cp:GET /explore`, etc. for control plane calls
- `reporter:health-refresh` spans if the SDK reporter is running

### With Grafana Tempo

Point `OTEL_EXPORTER_OTLP_ENDPOINT` to your Tempo instance's OTLP receiver and query traces in Grafana.

## Tuning for production

### Reduce trace volume

Set `sampleRate` below 1.0 for high-traffic agents:

```yaml
observability:
tracing:
backend: otel
sampleRate: 0.1 # trace 10% of requests
```

### Use gRPC for high throughput

gRPC is more efficient for large trace volumes in Kubernetes environments:

```bash
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
```

## See also

- [Observability concepts](/concepts/observability) -- architecture and span reference
- [Runtime Introspection](/concepts/runtime-introspection) -- live health reporting
- [Add Runtime Health](/guides/add-runtime-health) -- integrate the SDK reporter
41 changes: 41 additions & 0 deletions packages/otel/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
{
"name": "@agentspec/otel",
"version": "0.1.0",
"description": "AgentSpec OpenTelemetry tracing - shared provider, exporter, and context propagation",
"license": "Apache-2.0",
"type": "module",
"main": "./dist/index.cjs",
"module": "./dist/index.js",
"types": "./dist/index.d.ts",
"exports": {
".": {
"types": "./dist/index.d.ts",
"import": "./dist/index.js",
"require": "./dist/index.cjs"
}
},
"files": ["dist"],
"scripts": {
"build": "tsup",
"dev": "tsup --watch",
"test": "vitest run",
"typecheck": "tsc --noEmit",
"lint": "tsc --noEmit",
"clean": "rm -rf dist"
},
"dependencies": {
"@opentelemetry/api": "^1.9.0",
"@opentelemetry/core": "^1.30.0",
"@opentelemetry/sdk-trace-base": "^1.30.0",
"@opentelemetry/exporter-trace-otlp-http": "^0.57.0",
"@opentelemetry/exporter-trace-otlp-grpc": "^0.57.0",
"@opentelemetry/resources": "^1.30.0",
"@opentelemetry/semantic-conventions": "^1.30.0"
},
"devDependencies": {
"@types/node": "^20.17.0",
"tsup": "^8.3.5",
"typescript": "^5.7.2",
"vitest": "^2.1.8"
}
}
Loading
Loading