josephfung · josephfung · May 12, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,14 @@ All notable changes to TrimKit are documented here.
 
 ## [0.5.1] - Unreleased
 
+### Sysops learnings
+- Per-deployment learnings persistence — the sysops agent writes a structured entry to `~/.claude/sysops/learnings.jsonl` whenever it discovers a server quirk, known-safe container, procedure deviation, or other deployment-specific context worth remembering
+- `trimkit-learnings-log` — bin script that appends a learning entry to the store; reads JSON from stdin, injects `ts`, validates required fields and type enum, and writes atomically
+- `trimkit-learnings-search` — bin script that reads the store, deduplicates by `(deployment, key)` pair (latest entry per pair wins), filters by deployment, and outputs JSONL or formatted text (`--human`)
+- `trimkit-sysops-log-search` — bin script extracted from SKILL.md that reads `audit.jsonl`, filters by deployment and entry count (`--last N`), and outputs JSONL or formatted text (`--human`)
+- `/sysops learnings` sub-command — view stored learnings for all deployments or a specific one
+- SKILL.md refactored from ~170 lines of inline Python to a ~30-line routing layer that delegates to the bin scripts
+
 ### CLAUDE.md guidance
 - Pull before branching — injected instruction to run `git pull --ff-only` before creating worktrees or branches
 - Issue tracker hygiene — injected instructions to apply pre-existing labels and include acceptance criteria when creating or editing issues

diff --git a/agents/sysops.md b/agents/sysops.md
@@ -33,6 +33,8 @@ basename "$PWD"
 ```
 Note the output (e.g. `curia`). This is your PROJECT.
 
+Before working on each deployment, load its stored learnings (see each section below). At the end of each deployment's work, write any learnings you discovered.
+
 ## Intent detection
 
 Determine what the user wants based on how they invoked you:
@@ -46,6 +48,20 @@ For maintenance, determine scope:
 
 ## Status check
 
+### Load deployment learnings
+
+Before running checks on each deployment, query its stored learnings. Use the literal deployment name (e.g. `Pulse`):
+
+```bash
+if command -v trimkit-learnings-search > /dev/null 2>&1; then trimkit-learnings-search --deployment "<deployment_name>"; fi
+```
+
+If any learnings are returned, surface them briefly before the status report under a **Known quirks** heading, so they inform your interpretation of results. If the command produces output on stderr, include it as a warning under that heading.
+
+If `trimkit-learnings-search` is not on PATH, skip silently. If it is found but exits non-zero, note `(learnings unavailable — error loading store)` before the status report and continue; do not treat this as fatal.
+
+### Checks
+
 For each deployment, run these commands via SSH and collect results:
 
 ```bash
@@ -153,10 +169,66 @@ rm -f /tmp/trimkit-sysops-entry.json
 
 If `trimkit-sysops-log` is not found (trimkit not installed), skip logging and note it in the report output: `(audit log skipped — trimkit-sysops-log not on PATH)`. Non-fatal; must not block the report.
 
+### Write learnings (if applicable)
+
+If you observed something worth persisting for future sessions, write a learning entry. This does not require prompting the user — just write it and note `(learning recorded: <key>)` at the bottom of the report. Only write learnings that would be useful context in a future session; skip routine, uneventful checks.
+
+Triggers that warrant a learning:
+- User confirmed an unregistered container as known-safe (A) → `type: "known-safe"`, `confidence: 1.0`
+- A container was consistently unhealthy in a way that appears server-specific → `type: "quirk"`, `confidence: 0.8`
+- A persistent server condition (unusual disk state, always-high memory, etc.) that is expected and normal for this server → `type: "known-safe"`, `confidence: 0.9`
+
+Use a stable kebab-case `key` that describes the observation (e.g. `redis-cache-known-safe`, `high-mem-expected-on-pulse`).
+
+**Call 1** — build the JSON entry. Substitute all placeholders with actual values, including the **literal** SESSION UUID you captured above (not a shell variable — it does not persist across Bash calls). The session UUID in the temp file path prevents collision with concurrent agent invocations:
+```bash
+python3 -c "
+import json, sys
+print(json.dumps({
+    'deployment': sys.argv[1],
+    'key':        sys.argv[2],
+    'type':       sys.argv[3],
+    'insight':    sys.argv[4],
+    'confidence': float(sys.argv[5]),
+    'source':     'observed'
+}))
+" \
+  "<deployment name, e.g. Pulse>" \
+  "<kebab-case key, e.g. redis-cache-known-safe>" \
+  "<type: quirk|known-safe|procedure|warning>" \
+  "<human-readable insight describing what was learned>" \
+  "<confidence 0.0–1.0, e.g. 0.9>" \
+  > /tmp/trimkit-sysops-learning-<literal SESSION UUID, e.g. a1b2c3d4-e5f6-7890-abcd-ef1234567890>.json
+```
+
+**Call 2** — append to the learnings store. Only run if Call 1 exited successfully. Substitute the same literal SESSION UUID in the temp file path:
+```bash
+if command -v trimkit-learnings-log > /dev/null 2>&1; then trimkit-learnings-log < /tmp/trimkit-sysops-learning-<literal SESSION UUID>.json; fi
+```
+- If `trimkit-learnings-log` is not on PATH, note `(learning write skipped — trimkit-learnings-log not on PATH)` in the report instead of `(learning recorded: <key>)`.
+- If `trimkit-learnings-log` is found but returns a non-zero exit code, note `(learning write failed: <key>)` in the report instead of `(learning recorded: <key>)`.
+
+**Call 3** — clean up. Substitute the same literal SESSION UUID:
+```bash
+rm -f /tmp/trimkit-sysops-learning-<literal SESSION UUID>.json
+```
+
 ## Maintenance
 
 Run in sequence for each targeted deployment. Do not proceed to the next deployment until the current one is fully complete (or has failed).
 
+### 0. Load deployment learnings
+
+Before running maintenance on each deployment, query its stored learnings. Use the literal deployment name:
+
+```bash
+if command -v trimkit-learnings-search > /dev/null 2>&1; then trimkit-learnings-search --deployment "<deployment_name>"; fi
+```
+
+If any learnings are returned, surface them under a **Known quirks** heading before beginning work — they may describe procedures or quirks that affect how you should run maintenance. If the command produces output on stderr, include it as a warning under that heading.
+
+If `trimkit-learnings-search` is not on PATH, skip silently. If it is found but exits non-zero, note `(learnings unavailable — error loading store)` before the maintenance output and continue; do not treat this as fatal.
+
 ### 1. Apply updates
 
 ```bash
@@ -202,7 +274,20 @@ Containers after: <name: status> for each
 Notes: <any failures or issues>
 ```
 
-### 6. Write audit log entry
+### 6. Write learnings (if applicable)
+
+If you discovered something worth persisting for future sessions, write a learning entry — silently, without prompting the user. Note `(learning recorded: <key>)` in the maintenance report for that deployment. Skip if nothing notable was observed.
+
+Triggers that warrant a learning during maintenance:
+- A container did not auto-recover after reboot and required manual restart → `type: "quirk"`, `confidence: 0.9`
+- A package upgrade caused unexpected service behaviour → `type: "warning"`, `confidence: 0.8`
+- A procedure deviated from the standard playbook in a way that should be repeated next time → `type: "procedure"`, `confidence: 0.9`
+- User confirmed an unregistered container as known-safe (A) → `type: "known-safe"`, `confidence: 1.0`
+- Any other server-specific quirk that will save time in a future session → `type: "quirk"`, `confidence: 0.7–0.9`
+
+Use the same 3-call write pattern as in the status check's learnings section: same JSON schema, same `trimkit-learnings-log` invocation, same session-scoped temp file path (`/tmp/trimkit-sysops-learning-<literal SESSION UUID>.json`), same cleanup, and same Call 2 failure handling (`(learning write failed: <key>)` on non-zero exit, `(learning write skipped — trimkit-learnings-log not on PATH)` if tool not on PATH).
+
+### 7. Write audit log entry
 
 After completing maintenance on a deployment (success or failure), write an audit entry. Run these three commands in sequence:
 

diff --git a/bin/trimkit-learnings-log b/bin/trimkit-learnings-log
@@ -0,0 +1,82 @@
+#!/usr/bin/env bash
+# trimkit-learnings-log — Append a sysops learning entry to the persistent JSONL store.
+#
+# Reads a JSON object from stdin, injects a "ts" field (current UTC timestamp),
+# and appends the entry as a single line to the learnings log.
+#
+# Required fields in the JSON input:
+#   deployment  string    deployment name (e.g. "Pulse")
+#   key         string    stable kebab-case identifier for this learning (e.g. "caddy-restart-required-after-upgrade")
+#   type        string    one of: quirk, known-safe, procedure, warning
+#   insight     string    human-readable description of what was learned
+#   confidence  number    0.0–1.0; how confident the agent is in this learning
+#   source      string    how the learning was discovered (e.g. "observed", "inferred")
+#
+# Deduplication is handled at read time by trimkit-learnings-search: the latest
+# entry for a given (deployment, key) pair wins. The same key on different
+# deployments is kept as independent entries. This script never modifies existing entries.
+#
+# The single printf write is atomic on Linux/macOS for payloads under PIPE_BUF
+# (~4KB), which comfortably covers the learnings entry schema.
+#
+# Environment overrides (for testing):
+#   TRIMKIT_SYSOPS_LEARNINGS_DIR   override log directory   (default: ~/.claude/sysops)
+#   TRIMKIT_SYSOPS_LEARNINGS_FILE  override log file path   (default: $DIR/learnings.jsonl)
+#
+# Usage:
+#   echo '{"deployment":"Pulse","key":"caddy-restart","type":"quirk",...}' | trimkit-learnings-log
+set -euo pipefail
+
+LEARNINGS_DIR="${TRIMKIT_SYSOPS_LEARNINGS_DIR:-${HOME}/.claude/sysops}"
+LEARNINGS_FILE="${TRIMKIT_SYSOPS_LEARNINGS_FILE:-${LEARNINGS_DIR}/learnings.jsonl}"
+
+# Read JSON from stdin
+json="$(cat)"
+
+if [ -z "$json" ]; then
+  echo "trimkit-learnings-log: error: no JSON received on stdin" >&2
+  exit 1
+fi
+
+# Current UTC timestamp
+ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+
+# Inject ts and validate required fields — pipe json into python3 via stdin to
+# avoid shell-argument quoting issues and ARG_MAX limits with large payloads.
+# Validated fields:
+#   key        — dedup anchor; an empty key corrupts the (deployment, key) dedup bucket
+#   deployment — scopes dedup; an empty value silently files the entry under no deployment
+#   type       — must be a known value to keep the store queryable and meaningful
+entry="$(printf '%s' "$json" | python3 -c "
+import json, sys
+try:
+    obj = json.load(sys.stdin)
+except (json.JSONDecodeError, ValueError) as e:
+    print(f'trimkit-learnings-log: error: invalid JSON on stdin: {e}', file=sys.stderr)
+    sys.exit(1)
+for field in ('deployment', 'key', 'type', 'insight', 'source'):
+    if not obj.get(field):
+        print(f'trimkit-learnings-log: error: \"{field}\" field is required and must be non-empty', file=sys.stderr)
+        sys.exit(1)
+valid_types = {'quirk', 'known-safe', 'procedure', 'warning'}
+if obj['type'] not in valid_types:
+    print(f'trimkit-learnings-log: error: \"type\" must be one of {sorted(valid_types)!r}, got {obj[\"type\"]!r}', file=sys.stderr)
+    sys.exit(1)
+conf = obj.get('confidence')
+if (
+    conf is None
+    or isinstance(conf, bool)
+    or not isinstance(conf, (int, float))
+    or not (0.0 <= conf <= 1.0)
+):
+    print(f'trimkit-learnings-log: error: \"confidence\" must be a number between 0.0 and 1.0, got {conf!r}', file=sys.stderr)
+    sys.exit(1)
+obj['ts'] = sys.argv[1]
+print(json.dumps(obj, separators=(',', ':')))
+" "$ts")"
+
+# Create log directory if needed
+mkdir -p "$LEARNINGS_DIR" || { echo "trimkit-learnings-log: error: cannot create log directory '$LEARNINGS_DIR'" >&2; exit 1; }
+
+# Append — atomic for single-line writes under PIPE_BUF
+printf '%s\n' "$entry" >> "$LEARNINGS_FILE" || { echo "trimkit-learnings-log: error: cannot write to '$LEARNINGS_FILE'" >&2; exit 1; }
diff --git a/bin/trimkit-learnings-search b/bin/trimkit-learnings-search
@@ -0,0 +1,144 @@
+#!/usr/bin/env bash
+# trimkit-learnings-search — Query the sysops learnings store.
+#
+# Reads ~/.claude/sysops/learnings.jsonl, deduplicates entries by (deployment, key)
+# pair (latest entry per pair wins), optionally filters by deployment, and outputs
+# the surviving entries to stdout.
+#
+# Usage:
+#   trimkit-learnings-search                         # JSONL output (deduplicated)
+#   trimkit-learnings-search --deployment Pulse      # filter to Pulse deployment
+#   trimkit-learnings-search --human                 # human-readable formatted output
+#   trimkit-learnings-search --human --deployment Pulse
+#
+# Arguments:
+#   --deployment <name>   Case-insensitive deployment filter (optional)
+#   --human               Output human-readable formatted text instead of JSONL
+#
+# Output:
+#   Default: One JSON object per line (JSONL), latest-wins deduplicated.
+#   --human: Formatted text with deployment, type, key, confidence, source, timestamp,
+#            and insight for each entry. Shows "No learnings found" when empty.
+#   Exits 0 with no output (JSONL) or a message (--human) if no matching entries exist
+#   or if the store doesn't exist yet.
+#
+# Environment overrides (for testing):
+#   TRIMKIT_SYSOPS_LEARNINGS_DIR   override log directory   (default: ~/.claude/sysops)
+#   TRIMKIT_SYSOPS_LEARNINGS_FILE  override log file path   (default: $DIR/learnings.jsonl)
+set -euo pipefail
+
+LEARNINGS_DIR="${TRIMKIT_SYSOPS_LEARNINGS_DIR:-${HOME}/.claude/sysops}"
+LEARNINGS_FILE="${TRIMKIT_SYSOPS_LEARNINGS_FILE:-${LEARNINGS_DIR}/learnings.jsonl}"
+
+# Parse arguments
+deployment_filter=""
+human_output="false"
+while [ $# -gt 0 ]; do
+  case "$1" in
+    --deployment)
+      # Require a non-empty value following the flag
+      if [ $# -lt 2 ] || [ -z "${2:-}" ]; then
+        echo "trimkit-learnings-search: error: --deployment requires an argument" >&2
+        echo "Usage: trimkit-learnings-search [--deployment <name>] [--human]" >&2
+        exit 1
+      fi
+      shift
+      deployment_filter="$1"
+      ;;
+    --human)
+      human_output="true"
+      ;;
+    *)
+      echo "trimkit-learnings-search: error: unknown argument '$1'" >&2
+      echo "Usage: trimkit-learnings-search [--deployment <name>] [--human]" >&2
+      exit 1
+      ;;
+  esac
+  shift
+done
+
+if [ ! -f "$LEARNINGS_FILE" ]; then
+  # Not an error — the store simply hasn't been written to yet
+  if [ "$human_output" = "true" ]; then
+    echo "No learnings stored yet at $LEARNINGS_FILE"
+    echo "Learnings are written by the sysops agent when it discovers server quirks."
+  fi
+  exit 0
+fi
+
+# Read, deduplicate by (deployment, key) pair (latest entry wins), filter by
+# deployment, and output surviving entries (JSONL or human-readable).
+# Dedup strategy: scan all lines in order; for each (deployment, key) pair, keep
+# overwriting with the latest seen entry. Scoping dedup to deployment prevents a
+# key written for one deployment from suppressing the same key on another.
+python3 -c "
+import json, sys
+
+learnings_file = sys.argv[1]
+deployment_filter = sys.argv[2]  # empty string means no filter
+human_output = sys.argv[3] == 'true'
+
+seen = {}    # (deployment, key) -> entry; latest entry per pair wins
+order = []   # insertion-order list of (deployment, key) pairs, first-seen only
+
+corrupt_count = 0
+try:
+    with open(learnings_file) as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                obj = json.loads(line)
+            except json.JSONDecodeError:
+                corrupt_count += 1
+                continue
+            dedup_key = (obj.get('deployment', ''), obj.get('key', ''))
+            if dedup_key not in seen:
+                order.append(dedup_key)
+            seen[dedup_key] = obj
+except PermissionError:
+    print(f'trimkit-learnings-search: error: cannot read {learnings_file} (permission denied)', file=sys.stderr)
+    sys.exit(1)
+except OSError as e:
+    print(f'trimkit-learnings-search: error: cannot read learnings file: {e}', file=sys.stderr)
+    sys.exit(1)
+
+if corrupt_count:
+    print(f'trimkit-learnings-search: warning: {corrupt_count} corrupt line(s) skipped', file=sys.stderr)
+
+# Collect filtered entries in first-seen order
+entries = []
+for dedup_key in order:
+    entry = seen[dedup_key]
+    if deployment_filter and entry.get('deployment', '').lower() != deployment_filter.lower():
+        continue
+    entries.append(entry)
+
+if human_output:
+    if not entries:
+        msg = 'No learnings found'
+        if deployment_filter:
+            msg += f' for deployment {deployment_filter!r}'
+        print(msg + '.')
+        if corrupt_count:
+            print(f'Warning: {corrupt_count} corrupt line(s) skipped in learnings store.')
+        sys.exit(0)
+    for e in entries:
+        ts         = e.get('ts', 'unknown')
+        dep        = e.get('deployment', 'unknown')
+        key        = e.get('key', '?')
+        ltype      = e.get('type', '?')
+        insight    = e.get('insight', '')
+        confidence = e.get('confidence', '?')
+        source     = e.get('source', '?')
+        print(f'{dep}  [{ltype}]  {key}  (confidence: {confidence}, source: {source}, recorded: {ts})')
+        if insight:
+            print(f'  {insight}')
+        print()
+    if corrupt_count:
+        print(f'Warning: {corrupt_count} corrupt line(s) skipped in learnings store.')
+else:
+    for e in entries:
+        print(json.dumps(e, separators=(',', ':')))
+" "$LEARNINGS_FILE" "$deployment_filter" "$human_output"