@@ -5,10 +5,14 @@ AgentSpec's compliance system scores your agent against security and quality bes
55## Running an Audit
66
77``` bash
8+ # Declaration checks only (no I/O)
89npx agentspec audit agent.yaml
10+
11+ # + proof records from a running sidecar (dual score)
12+ npx agentspec audit agent.yaml --url http://localhost:4001
913```
1014
11- Output:
15+ Output without ` --url ` :
1216```
1317 AgentSpec Audit — my-agent
1418 ────────────────────────────────
@@ -23,89 +27,159 @@ Output:
2327
2428 Violations (4)
2529
26- [critical] SEC-LLM-06 — Sensitive data disclosure: PII scrub in memory hygiene
30+ [critical] [X] SEC-LLM-06 — Sensitive data disclosure: PII scrub in memory hygiene
2731 Long-term memory declared without piiScrubFields — PII may be persisted.
2832 Path: /spec/memory/hygiene/piiScrubFields
2933 → Add spec.memory.hygiene.piiScrubFields: [ssn, credit_card, bank_account]
30- https://owasp.org/www-project-top-10-for-large-language-model-applications/
34+ → Prove: Microsoft Presidio
35+ https://microsoft.github.io/presidio/
36+ ```
37+
38+ Output with ` --url http://localhost:4001 ` (dual score):
39+ ```
40+ AgentSpec Audit — research-agent
41+ ══════════════════════════════════
42+ Declared score : D 65/100 — what your spec says
43+ Proved score : F 35/100 — what has been verified
44+ Pending proof : 4 rules — run external tools and POST to http://localhost:4001/proof/rule/:ruleId
45+ Rules : 18 passed / 4 failed / 22 total
3146```
3247
3348## Evidence Tiers
3449
35- Every audit rule and gap issue carries an ** evidence tier** label that tells you what kind of evidence backs the finding:
50+ Every audit rule carries an ** evidence tier** label that tells you what kind of evidence backs the finding:
3651
37- | Badge | Tier | Meaning |
38- | -------| ------| ---------|
39- | ` [D] ` | Declarative | Manifest analysis only — we read the YAML, no I/O required |
40- | ` [P] ` | Probed | Health check verified at infrastructure level (` agentspec health ` ) |
41- | ` [B] ` | Behavioral | Runtime events confirmed actual execution (sdk-langgraph + EventPush) |
52+ | Badge | Tier | Meaning | How to prove |
53+ | -------| ------| ---------| -------------|
54+ | ` [D] ` | Declarative | Manifest analysis only — reads the YAML, no I/O | (always available) |
55+ | ` [P] ` | Probed | Health check verified at infrastructure level | ` agentspec health <file> ` |
56+ | ` [B] ` | Behavioral | Runtime events confirmed actual execution | AgentSpec EventPush + sidecar |
57+ | ` [X] ` | External | Proved by an external CI tool (k6, Presidio, Promptfoo, LiteLLM) | POST to ` /proof/rule/:ruleId ` |
4258
43- All current audit rules are ` [D] ` — declarative. The grade (A–F) reflects manifest declarations only.
59+ ### Declared vs Proved
4460
45- The ` agentspec audit ` output shows ` [D] ` badges next to each violation :
61+ The ** declared score ** reflects what your ` agent.yaml ` says. It only tells you that you've filled in the fields. The ** proved score ** tells you what has actually been verified :
4662
4763```
48- [critical] [D] SEC-LLM-06 — Sensitive data disclosure
49- Long-term memory declared without piiScrubFields
50- → Add spec.memory.hygiene.piiScrubFields: [ssn, credit_card, bank_account]
64+ Declared score: 65 D ← you said it; we checked the YAML
65+ Proved score: 35 F ← only this fraction has been independently verified
66+ Pending proof: 4 rules ← these pass declaratively but need external tool verification
67+ ```
5168
52- Evidence Breakdown
53- [D] Declarative 18/22 (manifest declarations)
54- [P] Probed N/A (run `agentspec health <file>` for live checks)
55- [B] Behavioral N/A (no runtime events — deploy with sdk-langgraph + EventPush)
69+ Use the sidecar proof endpoint to submit verification results:
70+ ``` bash
71+ # After k6 rate limit test passes
72+ curl -X POST http://localhost:4001/proof/rule/SEC-LLM-04 \
73+ -H ' Content-Type: application/json' \
74+ -d ' {"verifiedBy":"k6","method":"1200 req/min, 429 at 1000 — 100% enforced"}'
5675```
5776
58- See [ Probe Coverage] ( ./probe-coverage.md ) for a complete field-by-field matrix of what each tier verifies.
77+ See [ Proof Integration Guide] ( ../guides/proof-integration.md ) for tool-by-tool instructions.
78+
79+ ## Rule Classification
80+
81+ All 25 rules are classified by evidence tier:
82+
83+ ### Probed — verified by ` agentspec health `
84+
85+ | Rule | Description | Severity |
86+ | ------| -------------| ----------|
87+ | SEC-LLM-03 | System prompt loaded from versioned ` $file: ` | medium |
88+ | SEC-LLM-05 | Model provider and version pinned | medium |
89+ | SEC-LLM-09 | Evaluation framework + CI gate configured | medium |
90+ | SEC-LLM-10 | API keys use ` $secret: ` not ` $env: ` | high |
91+ | MODEL-02 | Model version pinned (not "latest") | medium |
92+ | MEM-02 | TTL set for all memory backends | high |
93+ | MEM-03 | Audit log enabled | medium |
94+ | MEM-04 | Vector store namespace isolated | medium |
95+ | MEM-05 | Short-term memory max tokens bounded | low |
96+ | EVAL-01 | Evaluation dataset declared | medium |
97+ | EVAL-02 | CI gate enabled | medium |
98+ | EVAL-03 | Hallucination threshold configured | medium |
99+ | OBS-01 | Tracing backend declared | medium |
100+
101+ ### Behavioral — verified by runtime events
102+
103+ | Rule | Description | Severity | Proof tool |
104+ | ------| -------------| ----------| -----------|
105+ | SEC-LLM-01 | Input guardrail actually invoked | high | AgentSpec EventPush |
106+ | SEC-LLM-02 | Output guardrail actually invoked | high | AgentSpec EventPush |
107+ | OBS-02 | Log lines contain structured JSON | low | AgentSpec EventPush |
108+
109+ ### External — verified by dedicated CI tools
110+
111+ | Rule | Description | Severity | Proof tool |
112+ | ------| -------------| ----------| -----------|
113+ | SEC-LLM-04 | Rate limit enforced under load | medium | [ k6] ( https://k6.io ) |
114+ | SEC-LLM-06 | PII actually scrubbed from memory | ** critical** | [ Microsoft Presidio] ( https://microsoft.github.io/presidio/ ) |
115+ | SEC-LLM-07 | Tool annotations respected by agent | medium | [ Promptfoo] ( https://promptfoo.dev ) |
116+ | SEC-LLM-08 | Destructive tools flagged and constrained | high | [ Promptfoo] ( https://promptfoo.dev ) |
117+ | MODEL-01 | Fallback actually invoked on failure | high | [ LiteLLM chaos test] ( https://docs.litellm.ai/docs/proxy/reliability ) |
118+ | MODEL-03 | Cost controls enforced by spend tracker | medium | [ LiteLLM Spend Tracking] ( https://docs.litellm.ai/docs/proxy/cost_tracking ) |
119+ | MODEL-04 | Retry strategy works correctly | low | [ pytest-mockllm] ( https://pypi.org/project/pytest-mockllm/ ) |
120+ | MEM-01 | PII scrub fields actually prevent PII storage | ** critical** | [ Microsoft Presidio] ( https://microsoft.github.io/presidio/ ) |
121+ | OBS-03 | Log redaction prevents PII in log aggregators | medium | [ Microsoft Presidio] ( https://microsoft.github.io/presidio/ ) |
59122
60123## Compliance Packs
61124
62125### ` owasp-llm-top10 `
63126
6412710 rules aligned to [ OWASP LLM Top 10 (2025)] ( https://owasp.org/www-project-top-10-for-large-language-model-applications/ ) :
65128
66- | Rule ID | Description | Severity |
67- | ---------| -------------| ----------|
68- | SEC-LLM-01 | Input guardrail required (prompt injection) | high |
69- | SEC-LLM-02 | Output guardrail required (insecure output) | high |
70- | SEC-LLM-04 | Rate limiting + cost controls (model DoS) | medium |
71- | SEC-LLM-05 | Model provider and version pinned | medium |
72- | SEC-LLM-06 | PII scrub for long-term memory | ** critical** |
73- | SEC-LLM-07 | Tool annotations declared | medium |
74- | SEC-LLM-08 | destructiveHint on all tools | high |
75- | SEC-LLM-09 | Evaluation + CI gate | medium |
76- | SEC-LLM-10 | API keys use $secret not $env | high |
129+ | Rule ID | Description | Severity | Tier |
130+ | ---------| -------------| ----------| ------|
131+ | SEC-LLM-01 | Input guardrail required (prompt injection) | high | [ B] |
132+ | SEC-LLM-02 | Output guardrail required (insecure output) | high | [ B] |
133+ | SEC-LLM-03 | System prompt loaded from versioned file | medium | [ P] |
134+ | SEC-LLM-04 | Rate limiting + cost controls (model DoS) | medium | [ X] |
135+ | SEC-LLM-05 | Model provider and version pinned | medium | [ P] |
136+ | SEC-LLM-06 | PII scrub for long-term memory | ** critical** | [ X] |
137+ | SEC-LLM-07 | Tool annotations declared | medium | [ X] |
138+ | SEC-LLM-08 | destructiveHint on all tools | high | [ X] |
139+ | SEC-LLM-09 | Evaluation + CI gate | medium | [ P] |
140+ | SEC-LLM-10 | API keys use $secret not $env | high | [ P] |
77141
78142### ` model-resilience `
79143
80- | Rule ID | Description | Severity |
81- | ---------| -------------| ----------|
82- | MODEL-01 | Fallback model declared | high |
83- | MODEL-02 | Model version pinned (not "latest") | medium |
84- | MODEL-03 | Cost controls declared | medium |
85- | MODEL-04 | Fallback retry strategy | low |
144+ | Rule ID | Description | Severity | Tier |
145+ | ---------| -------------| ----------| ------ |
146+ | MODEL-01 | Fallback model declared | high | [ X ] |
147+ | MODEL-02 | Model version pinned (not "latest") | medium | [ P ] |
148+ | MODEL-03 | Cost controls declared | medium | [ X ] |
149+ | MODEL-04 | Fallback retry strategy | low | [ X ] |
86150
87151### ` memory-hygiene `
88152
89- | Rule ID | Description | Severity |
90- | ---------| -------------| ----------|
91- | MEM-01 | PII scrub fields for long-term memory | critical |
92- | MEM-02 | TTL set for all memory backends | high |
93- | MEM-03 | Audit log enabled | medium |
94- | MEM-04 | Vector store namespace isolated | medium |
95- | MEM-05 | Short-term memory max tokens bounded | low |
153+ | Rule ID | Description | Severity | Tier |
154+ | ---------| -------------| ----------| ------ |
155+ | MEM-01 | PII scrub fields for long-term memory | critical | [ X ] |
156+ | MEM-02 | TTL set for all memory backends | high | [ P ] |
157+ | MEM-03 | Audit log enabled | medium | [ P ] |
158+ | MEM-04 | Vector store namespace isolated | medium | [ P ] |
159+ | MEM-05 | Short-term memory max tokens bounded | low | [ P ] |
96160
97161### ` evaluation-coverage `
98162
99- | Rule ID | Description | Severity |
100- | ---------| -------------| ----------|
101- | EVAL-01 | Evaluation dataset declared | medium |
102- | EVAL-02 | CI gate enabled | medium |
103- | EVAL-03 | Hallucination threshold configured | medium |
163+ | Rule ID | Description | Severity | Tier |
164+ | ---------| -------------| ----------| ------|
165+ | EVAL-01 | Evaluation dataset declared | medium | [ P] |
166+ | EVAL-02 | CI gate enabled | medium | [ P] |
167+ | EVAL-03 | Hallucination threshold configured | medium | [ P] |
168+
169+ ### ` observability `
170+
171+ | Rule ID | Description | Severity | Tier |
172+ | ---------| -------------| ----------| ------|
173+ | OBS-01 | Tracing backend declared | medium | [ P] |
174+ | OBS-02 | Structured logging enabled | low | [ B] |
175+ | OBS-03 | Sensitive fields redacted from logs | medium | [ X] |
104176
105177## Scoring
106178
107179- Each rule has a weight: critical=4, high=3, medium=2, low=1, info=0
108- - Score = (sum of passed weights) / (sum of total weights) × 100
180+ - ** Declared score** = (sum of passed weights) / (sum of total weights) × 100
181+ - ** Proved score** = (sum of proved weights) / (sum of total weights) × 100
182+ - Proved = ` [P] ` rules that pass + ` [B] ` rules observed + ` [X] ` rules with proof records
109183- Grades: A≥90, B≥75, C≥60, D≥45, F<45
110184
111185## Suppressing Rules
@@ -127,12 +201,15 @@ Suppressed rules are excluded from scoring but logged in the audit report.
127201## Running in CI
128202
129203` ` ` bash
130- # Fail CI if score drops below 70
204+ # Fail CI if declared score drops below 70
131205npx agentspec audit agent.yaml --fail-below 70
132206
133207# Run only security rules
134208npx agentspec audit agent.yaml --pack owasp-llm-top10
135209
210+ # Fetch proof records from sidecar + dual score in JSON
211+ npx agentspec audit agent.yaml --url http://localhost:4001 --json --output audit-report.json
212+
136213# Output JSON for processing
137214npx agentspec audit agent.yaml --json --output audit-report.json
138215```
@@ -149,3 +226,9 @@ spec:
149226` ` `
150227
151228This is declarative — actual scheduling requires a cron job or CI workflow that runs ` agentspec audit`.
229+
230+ # # See also
231+
232+ - [Proof Integration Guide](../guides/proof-integration.md) — how to wire k6, LiteLLM, Promptfoo, and Presidio
233+ - [Probe Coverage](./probe-coverage.md) — field-by-field evidence tier matrix
234+ - [CLI Reference — agentspec audit](../reference/cli.md#agentspec-audit)
0 commit comments