Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@ All notable changes to TrimKit are documented here.

## [0.5.1] - Unreleased

### Sysops learnings
- Per-deployment learnings persistence — the sysops agent writes a structured entry to `~/.claude/sysops/learnings.jsonl` whenever it discovers a server quirk, known-safe container, procedure deviation, or other deployment-specific context worth remembering
- `trimkit-learnings-log` — bin script that appends a learning entry to the store; reads JSON from stdin, injects `ts`, validates required fields and type enum, and writes atomically
- `trimkit-learnings-search` — bin script that reads the store, deduplicates by `(deployment, key)` pair (latest entry per pair wins), filters by deployment, and outputs JSONL or formatted text (`--human`)
- `trimkit-sysops-log-search` — bin script extracted from SKILL.md that reads `audit.jsonl`, filters by deployment and entry count (`--last N`), and outputs JSONL or formatted text (`--human`)
- `/sysops learnings` sub-command — view stored learnings for all deployments or a specific one
- SKILL.md refactored from ~170 lines of inline Python to a ~30-line routing layer that delegates to the bin scripts

### CLAUDE.md guidance
- Pull before branching — injected instruction to run `git pull --ff-only` before creating worktrees or branches
- Issue tracker hygiene — injected instructions to apply pre-existing labels and include acceptance criteria when creating or editing issues
Expand Down
87 changes: 86 additions & 1 deletion agents/sysops.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ basename "$PWD"
```
Note the output (e.g. `curia`). This is your PROJECT.

Before working on each deployment, load its stored learnings (see each section below). At the end of each deployment's work, write any learnings you discovered.

## Intent detection

Determine what the user wants based on how they invoked you:
Expand All @@ -46,6 +48,20 @@ For maintenance, determine scope:

## Status check

### Load deployment learnings

Before running checks on each deployment, query its stored learnings. Use the literal deployment name (e.g. `Pulse`):

```bash
if command -v trimkit-learnings-search > /dev/null 2>&1; then trimkit-learnings-search --deployment "<deployment_name>"; fi
```

If any learnings are returned, surface them briefly before the status report under a **Known quirks** heading, so they inform your interpretation of results. If the command produces output on stderr, include it as a warning under that heading.

If `trimkit-learnings-search` is not on PATH, skip silently. If it is found but exits non-zero, note `(learnings unavailable — error loading store)` before the status report and continue; do not treat this as fatal.

### Checks

For each deployment, run these commands via SSH and collect results:

```bash
Expand Down Expand Up @@ -153,10 +169,66 @@ rm -f /tmp/trimkit-sysops-entry.json

If `trimkit-sysops-log` is not found (trimkit not installed), skip logging and note it in the report output: `(audit log skipped — trimkit-sysops-log not on PATH)`. Non-fatal; must not block the report.

### Write learnings (if applicable)

If you observed something worth persisting for future sessions, write a learning entry. This does not require prompting the user — just write it and note `(learning recorded: <key>)` at the bottom of the report. Only write learnings that would be useful context in a future session; skip routine, uneventful checks.

Triggers that warrant a learning:
- User confirmed an unregistered container as known-safe (A) → `type: "known-safe"`, `confidence: 1.0`
- A container was consistently unhealthy in a way that appears server-specific → `type: "quirk"`, `confidence: 0.8`
- A persistent server condition (unusual disk state, always-high memory, etc.) that is expected and normal for this server → `type: "known-safe"`, `confidence: 0.9`

Use a stable kebab-case `key` that describes the observation (e.g. `redis-cache-known-safe`, `high-mem-expected-on-pulse`).

**Call 1** — build the JSON entry. Substitute all placeholders with actual values, including the **literal** SESSION UUID you captured above (not a shell variable — it does not persist across Bash calls). The session UUID in the temp file path prevents collision with concurrent agent invocations:
```bash
python3 -c "
import json, sys
print(json.dumps({
'deployment': sys.argv[1],
'key': sys.argv[2],
'type': sys.argv[3],
'insight': sys.argv[4],
'confidence': float(sys.argv[5]),
'source': 'observed'
}))
" \
"<deployment name, e.g. Pulse>" \
"<kebab-case key, e.g. redis-cache-known-safe>" \
"<type: quirk|known-safe|procedure|warning>" \
"<human-readable insight describing what was learned>" \
"<confidence 0.0–1.0, e.g. 0.9>" \
> /tmp/trimkit-sysops-learning-<literal SESSION UUID, e.g. a1b2c3d4-e5f6-7890-abcd-ef1234567890>.json
```

**Call 2** — append to the learnings store. Only run if Call 1 exited successfully. Substitute the same literal SESSION UUID in the temp file path:
```bash
if command -v trimkit-learnings-log > /dev/null 2>&1; then trimkit-learnings-log < /tmp/trimkit-sysops-learning-<literal SESSION UUID>.json; fi
```
- If `trimkit-learnings-log` is not on PATH, note `(learning write skipped — trimkit-learnings-log not on PATH)` in the report instead of `(learning recorded: <key>)`.
- If `trimkit-learnings-log` is found but returns a non-zero exit code, note `(learning write failed: <key>)` in the report instead of `(learning recorded: <key>)`.

**Call 3** — clean up. Substitute the same literal SESSION UUID:
```bash
rm -f /tmp/trimkit-sysops-learning-<literal SESSION UUID>.json
```

## Maintenance

Run in sequence for each targeted deployment. Do not proceed to the next deployment until the current one is fully complete (or has failed).

### 0. Load deployment learnings

Before running maintenance on each deployment, query its stored learnings. Use the literal deployment name:

```bash
if command -v trimkit-learnings-search > /dev/null 2>&1; then trimkit-learnings-search --deployment "<deployment_name>"; fi
```

If any learnings are returned, surface them under a **Known quirks** heading before beginning work — they may describe procedures or quirks that affect how you should run maintenance. If the command produces output on stderr, include it as a warning under that heading.

If `trimkit-learnings-search` is not on PATH, skip silently. If it is found but exits non-zero, note `(learnings unavailable — error loading store)` before the maintenance output and continue; do not treat this as fatal.

### 1. Apply updates

```bash
Expand Down Expand Up @@ -202,7 +274,20 @@ Containers after: <name: status> for each
Notes: <any failures or issues>
```

### 6. Write audit log entry
### 6. Write learnings (if applicable)

If you discovered something worth persisting for future sessions, write a learning entry — silently, without prompting the user. Note `(learning recorded: <key>)` in the maintenance report for that deployment. Skip if nothing notable was observed.

Triggers that warrant a learning during maintenance:
- A container did not auto-recover after reboot and required manual restart → `type: "quirk"`, `confidence: 0.9`
- A package upgrade caused unexpected service behaviour → `type: "warning"`, `confidence: 0.8`
- A procedure deviated from the standard playbook in a way that should be repeated next time → `type: "procedure"`, `confidence: 0.9`
- User confirmed an unregistered container as known-safe (A) → `type: "known-safe"`, `confidence: 1.0`
- Any other server-specific quirk that will save time in a future session → `type: "quirk"`, `confidence: 0.7–0.9`

Use the same 3-call write pattern as in the status check's learnings section: same JSON schema, same `trimkit-learnings-log` invocation, same session-scoped temp file path (`/tmp/trimkit-sysops-learning-<literal SESSION UUID>.json`), same cleanup, and same Call 2 failure handling (`(learning write failed: <key>)` on non-zero exit, `(learning write skipped — trimkit-learnings-log not on PATH)` if tool not on PATH).

### 7. Write audit log entry

After completing maintenance on a deployment (success or failure), write an audit entry. Run these three commands in sequence:

Expand Down
82 changes: 82 additions & 0 deletions bin/trimkit-learnings-log
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
#!/usr/bin/env bash
# trimkit-learnings-log — Append a sysops learning entry to the persistent JSONL store.
#
# Reads a JSON object from stdin, injects a "ts" field (current UTC timestamp),
# and appends the entry as a single line to the learnings log.
#
# Required fields in the JSON input:
# deployment string deployment name (e.g. "Pulse")
# key string stable kebab-case identifier for this learning (e.g. "caddy-restart-required-after-upgrade")
# type string one of: quirk, known-safe, procedure, warning
# insight string human-readable description of what was learned
# confidence number 0.0–1.0; how confident the agent is in this learning
# source string how the learning was discovered (e.g. "observed", "inferred")
#
# Deduplication is handled at read time by trimkit-learnings-search: the latest
# entry for a given (deployment, key) pair wins. The same key on different
# deployments is kept as independent entries. This script never modifies existing entries.
#
# The single printf write is atomic on Linux/macOS for payloads under PIPE_BUF
# (~4KB), which comfortably covers the learnings entry schema.
#
# Environment overrides (for testing):
# TRIMKIT_SYSOPS_LEARNINGS_DIR override log directory (default: ~/.claude/sysops)
# TRIMKIT_SYSOPS_LEARNINGS_FILE override log file path (default: $DIR/learnings.jsonl)
#
# Usage:
# echo '{"deployment":"Pulse","key":"caddy-restart","type":"quirk",...}' | trimkit-learnings-log
set -euo pipefail

LEARNINGS_DIR="${TRIMKIT_SYSOPS_LEARNINGS_DIR:-${HOME}/.claude/sysops}"
LEARNINGS_FILE="${TRIMKIT_SYSOPS_LEARNINGS_FILE:-${LEARNINGS_DIR}/learnings.jsonl}"

# Read JSON from stdin
json="$(cat)"

if [ -z "$json" ]; then
echo "trimkit-learnings-log: error: no JSON received on stdin" >&2
exit 1
fi

# Current UTC timestamp
ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Inject ts and validate required fields — pipe json into python3 via stdin to
# avoid shell-argument quoting issues and ARG_MAX limits with large payloads.
# Validated fields:
# key — dedup anchor; an empty key corrupts the (deployment, key) dedup bucket
# deployment — scopes dedup; an empty value silently files the entry under no deployment
# type — must be a known value to keep the store queryable and meaningful
entry="$(printf '%s' "$json" | python3 -c "
import json, sys
try:
obj = json.load(sys.stdin)
except (json.JSONDecodeError, ValueError) as e:
print(f'trimkit-learnings-log: error: invalid JSON on stdin: {e}', file=sys.stderr)
sys.exit(1)
for field in ('deployment', 'key', 'type', 'insight', 'source'):
if not obj.get(field):
print(f'trimkit-learnings-log: error: \"{field}\" field is required and must be non-empty', file=sys.stderr)
sys.exit(1)
valid_types = {'quirk', 'known-safe', 'procedure', 'warning'}
if obj['type'] not in valid_types:
print(f'trimkit-learnings-log: error: \"type\" must be one of {sorted(valid_types)!r}, got {obj[\"type\"]!r}', file=sys.stderr)
sys.exit(1)
conf = obj.get('confidence')
if (
conf is None
or isinstance(conf, bool)
or not isinstance(conf, (int, float))
or not (0.0 <= conf <= 1.0)
):
print(f'trimkit-learnings-log: error: \"confidence\" must be a number between 0.0 and 1.0, got {conf!r}', file=sys.stderr)
sys.exit(1)
obj['ts'] = sys.argv[1]
print(json.dumps(obj, separators=(',', ':')))
" "$ts")"

# Create log directory if needed
mkdir -p "$LEARNINGS_DIR" || { echo "trimkit-learnings-log: error: cannot create log directory '$LEARNINGS_DIR'" >&2; exit 1; }

# Append — atomic for single-line writes under PIPE_BUF
printf '%s\n' "$entry" >> "$LEARNINGS_FILE" || { echo "trimkit-learnings-log: error: cannot write to '$LEARNINGS_FILE'" >&2; exit 1; }
144 changes: 144 additions & 0 deletions bin/trimkit-learnings-search
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
#!/usr/bin/env bash
# trimkit-learnings-search — Query the sysops learnings store.
#
# Reads ~/.claude/sysops/learnings.jsonl, deduplicates entries by (deployment, key)
# pair (latest entry per pair wins), optionally filters by deployment, and outputs
# the surviving entries to stdout.
#
# Usage:
# trimkit-learnings-search # JSONL output (deduplicated)
# trimkit-learnings-search --deployment Pulse # filter to Pulse deployment
# trimkit-learnings-search --human # human-readable formatted output
# trimkit-learnings-search --human --deployment Pulse
#
# Arguments:
# --deployment <name> Case-insensitive deployment filter (optional)
# --human Output human-readable formatted text instead of JSONL
#
# Output:
# Default: One JSON object per line (JSONL), latest-wins deduplicated.
# --human: Formatted text with deployment, type, key, confidence, source, timestamp,
# and insight for each entry. Shows "No learnings found" when empty.
# Exits 0 with no output (JSONL) or a message (--human) if no matching entries exist
# or if the store doesn't exist yet.
#
# Environment overrides (for testing):
# TRIMKIT_SYSOPS_LEARNINGS_DIR override log directory (default: ~/.claude/sysops)
# TRIMKIT_SYSOPS_LEARNINGS_FILE override log file path (default: $DIR/learnings.jsonl)
set -euo pipefail

LEARNINGS_DIR="${TRIMKIT_SYSOPS_LEARNINGS_DIR:-${HOME}/.claude/sysops}"
LEARNINGS_FILE="${TRIMKIT_SYSOPS_LEARNINGS_FILE:-${LEARNINGS_DIR}/learnings.jsonl}"

# Parse arguments
deployment_filter=""
human_output="false"
while [ $# -gt 0 ]; do
case "$1" in
--deployment)
# Require a non-empty value following the flag
if [ $# -lt 2 ] || [ -z "${2:-}" ]; then
echo "trimkit-learnings-search: error: --deployment requires an argument" >&2
echo "Usage: trimkit-learnings-search [--deployment <name>] [--human]" >&2
exit 1
fi
shift
deployment_filter="$1"
;;
--human)
human_output="true"
;;
*)
echo "trimkit-learnings-search: error: unknown argument '$1'" >&2
echo "Usage: trimkit-learnings-search [--deployment <name>] [--human]" >&2
exit 1
;;
esac
shift
done

if [ ! -f "$LEARNINGS_FILE" ]; then
# Not an error — the store simply hasn't been written to yet
if [ "$human_output" = "true" ]; then
echo "No learnings stored yet at $LEARNINGS_FILE"
echo "Learnings are written by the sysops agent when it discovers server quirks."
fi
exit 0
fi

# Read, deduplicate by (deployment, key) pair (latest entry wins), filter by
# deployment, and output surviving entries (JSONL or human-readable).
# Dedup strategy: scan all lines in order; for each (deployment, key) pair, keep
# overwriting with the latest seen entry. Scoping dedup to deployment prevents a
# key written for one deployment from suppressing the same key on another.
python3 -c "
import json, sys

learnings_file = sys.argv[1]
deployment_filter = sys.argv[2] # empty string means no filter
human_output = sys.argv[3] == 'true'

seen = {} # (deployment, key) -> entry; latest entry per pair wins
order = [] # insertion-order list of (deployment, key) pairs, first-seen only

corrupt_count = 0
try:
with open(learnings_file) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
except json.JSONDecodeError:
corrupt_count += 1
continue
dedup_key = (obj.get('deployment', ''), obj.get('key', ''))
if dedup_key not in seen:
order.append(dedup_key)
seen[dedup_key] = obj
except PermissionError:
print(f'trimkit-learnings-search: error: cannot read {learnings_file} (permission denied)', file=sys.stderr)
sys.exit(1)
except OSError as e:
print(f'trimkit-learnings-search: error: cannot read learnings file: {e}', file=sys.stderr)
sys.exit(1)

if corrupt_count:
print(f'trimkit-learnings-search: warning: {corrupt_count} corrupt line(s) skipped', file=sys.stderr)

# Collect filtered entries in first-seen order
entries = []
for dedup_key in order:
entry = seen[dedup_key]
if deployment_filter and entry.get('deployment', '').lower() != deployment_filter.lower():
continue
entries.append(entry)

if human_output:
if not entries:
msg = 'No learnings found'
if deployment_filter:
msg += f' for deployment {deployment_filter!r}'
print(msg + '.')
if corrupt_count:
print(f'Warning: {corrupt_count} corrupt line(s) skipped in learnings store.')
sys.exit(0)
for e in entries:
ts = e.get('ts', 'unknown')
dep = e.get('deployment', 'unknown')
key = e.get('key', '?')
ltype = e.get('type', '?')
insight = e.get('insight', '')
confidence = e.get('confidence', '?')
source = e.get('source', '?')
print(f'{dep} [{ltype}] {key} (confidence: {confidence}, source: {source}, recorded: {ts})')
if insight:
print(f' {insight}')
print()
if corrupt_count:
print(f'Warning: {corrupt_count} corrupt line(s) skipped in learnings store.')
else:
for e in entries:
print(json.dumps(e, separators=(',', ':')))
" "$LEARNINGS_FILE" "$deployment_filter" "$human_output"
Loading
Loading