rlaope · rlaope · May 28, 2026 · May 28, 2026
diff --git a/internal/core/skills/builtin/ai-inference.md b/internal/core/skills/builtin/ai-inference.md
@@ -69,7 +69,7 @@ You are an AI inference-serving analyst. LLM serving latency is NOT one number
 
 ### Step 3 — Find the decode bottleneck: cache, batch, or GPU
 
-1. **KV-cache**: `prom.query_range` on `gpu_cache_usage_perc` (or `kv_cache_usage_perc` on newer vLLM) — riding near 100% with rising `num_preemptions_total` is the smoking gun for throughput that won't scale and ITL spikes. The fix is more GPU memory (bigger cache), shorter `max_model_len`, quantisation, or fewer concurrent sequences.
+1. **KV-cache**: `prom.query_range` on `gpu_cache_usage_perc` (or `kv_cache_usage_perc` on newer vLLM) — note that despite the `_perc` suffix vLLM emits this as a **fraction in [0,1]**, so "saturated" is ~0.95–1.0, NOT 95–100; write thresholds against 1.0 (e.g. `> 0.9`), not 90. Riding near 1.0 with rising `num_preemptions_total` is the smoking gun for throughput that won't scale and ITL spikes. The fix is more GPU memory (bigger cache), shorter `max_model_len`, quantisation, or fewer concurrent sequences.
 2. **Batch**: `num_requests_running` / `tgi_batch_current_size` at the configured max while the queue grows ⇒ batch-saturated; more replicas or a larger batch (if GPU has headroom) is the lever.
 3. **GPU**: correlate with DCGM. `GPU_UTIL` ~100% ⇒ compute-bound (tensor-parallel / more GPUs / smaller model). `FB_USED` near total ⇒ memory-capacity-bound (OOM risk — hand context to `gpu-saturation`). Thermal throttling (`GPU_TEMP` high with clocks dropping) is a separate, infra-level cause.
 

diff --git a/internal/core/skills/builtin/capacity-scheduling.md b/internal/core/skills/builtin/capacity-scheduling.md
@@ -62,7 +62,7 @@ Every Pending pod resolves to one of: **(a) no node has room** (capacity — sca
 
 1. `k8s.top_nodes` and `k8s.list_nodes`: per node, compute allocatable vs. used. A node at 95% CPU-requested cannot take a 1-core request even if its real CPU usage is low — scheduling is on **requests**, not utilisation. Make this distinction explicit.
 2. Flag nodes with conditions `MemoryPressure`, `DiskPressure`, `PIDPressure`, or `Ready != True` — these are excluded from scheduling regardless of headroom.
-3. If `prom.query` is wired, cross-check `kube_node_status_allocatable` vs. `kube_pod_container_resource_requests` summed per node for a second opinion on the requested-capacity math.
+3. If `prom.query` is wired, cross-check per resource — these kube-state-metrics series carry a `resource` label, so you MUST filter or the sum mixes cores and bytes into a meaningless number. Compare `kube_node_status_allocatable{resource="cpu"}` vs. `sum by (node) (kube_pod_container_resource_requests{resource="cpu"})`, then again with `resource="memory"`, for a second opinion on the requested-capacity math.
 
 ### Step 3 — Check the downstream blockers
 
@@ -82,7 +82,7 @@ Autoscaler:    <TriggeredScaleUp pending | NotTriggerScaleUp: reason | n/a>
 HPA:           <name> <cur>/<max> replicas, metric <value> vs target
 PDB:           <name> disruptionsAllowed=<n>  (or n/a)
 Most likely:   <one-sentence cause>
-Recommend:     <add nodes | lower requests to <value> | fix taint/affinity <detail> | raise maxReplicas | relax PDB>
+Recommend (operator-applied, read-only): <add nodes | lower requests to <value> | fix taint/affinity <detail> | raise maxReplicas | relax PDB>
 ```
 
 ## Operating Constraints

diff --git a/internal/core/skills/builtin/dotnet-runtime.md b/internal/core/skills/builtin/dotnet-runtime.md
@@ -38,7 +38,7 @@ You are a .NET / CLR runtime analyst. The CLR exposes rich runtime counters (Eve
 
 ## The mental model
 
-- **Generational GC.** Gen0/Gen1 collections are frequent and cheap; **Gen2** is a full collection and pauses longest. The **Large Object Heap (LOH)** holds allocations ≥ 85 KB, is collected with Gen2, and is *not compacted by default* — so LOH fragmentation drives both long Gen2 pauses and rising committed memory.
+- **Generational GC.** Gen0/Gen1 collections are frequent and cheap; **Gen2** is a full collection and pauses longest. The **Large Object Heap (LOH)** holds allocations ≥ 85,000 bytes (~83 KB, not 85 KB — an 84 KB array is already on the LOH), is collected with Gen2, and is *not compacted by default* — so LOH fragmentation drives both long Gen2 pauses and rising committed memory.
 - **Server GC vs. Workstation GC.** Server GC (default for ASP.NET, one heap+GC thread per core) maximises throughput but reserves more memory and assumes the box is dedicated. In a CPU-limited container with `<GCHeapCount>` mis-set, Server GC can oversubscribe; Workstation GC may behave better in tight single-core limits. `DOTNET_gcServer` / `DOTNET_GCHeapCount` are the levers.
 - **ThreadPool starvation** is the signature .NET latency cliff: blocking synchronous calls (`.Result` / `.Wait()` on async, sync-over-async) consume pool threads faster than the pool's slow injection — a hill-climbing controller, not a fixed cadence, but rule-of-thumb ≈1 thread / 500 ms past the min — can replace them. Symptom: latency degrades sharply under load, queue length climbs, then recovers when load drops — without CPU saturation.
 - **Tiered JIT.** .NET JITs to a quick Tier-0 first, then re-JITs hot methods to optimised Tier-1. Cold-start / post-deploy latency that settles after warmup is tiering, not a leak; ReadyToRun/`DOTNET_TieredPGO` affect it.

diff --git a/internal/core/skills/builtin/go-runtime.md b/internal/core/skills/builtin/go-runtime.md
@@ -39,8 +39,8 @@ You are a Go runtime analyst. Go's runtime exposes itself well: the `go_*` Prome
 ## The mental model
 
 - **Goroutines are cheap but not free.** A monotonically rising `go_goroutines` is the #1 Go leak: a goroutine blocked forever on a channel/lock/network read that never returns. Memory and scheduler overhead grow with it until OOM. A leak is *count never comes back down*, not *count is high*.
-- **GC is concurrent but allocation-paced.** Go's GC triggers on heap growth governed by **GOGC** (default 100 = next collection when the heap grows to 2× the live set retained after the last mark — relative to the live heap, not the total heap at the previous cycle's end). High allocation rate ⇒ frequent GC ⇒ CPU burned in `gc` and STW assist pauses. The lever is usually *allocate less* (reduce garbage), and only sometimes *raise GOGC* or set a `GOMEMLIMIT`.
-- **STW is short but real.** Modern Go STW pauses are sub-ms, but **mark-assist** (mutators forced to help GC when allocation outruns the background collector) shows up as latency on allocation-heavy paths. `go_gc_duration_seconds` quantiles capture the pause distribution.
+- **GC is concurrent but allocation-paced.** Go's GC triggers on heap growth governed by **GOGC** (default 100 = next collection when the heap grows to 2× the live set retained after the last mark — relative to the live heap, not the total heap at the previous cycle's end). High allocation rate ⇒ frequent GC ⇒ CPU burned in background mark workers and in mark-assist. The lever is usually *allocate less* (reduce garbage), and only sometimes *raise GOGC* or set a `GOMEMLIMIT`.
+- **STW is short; mark-assist is not STW.** Go's actual stop-the-world phases (sweep-termination, mark-termination) are sub-ms and are what `go_gc_duration_seconds` quantiles capture. **Mark-assist** is different: it is *concurrent* work the allocating goroutine is forced to do inline (paying down assist debt) when allocation outruns the background collector — the rest of the program keeps running. So assist cost does NOT show up in the STW pause quantiles; it surfaces as elevated per-request latency on allocation-heavy paths and as GC CPU. Don't clear GC just because `go_gc_duration_seconds` is tiny.
 - **Scheduler latency** (`/sched/latencies` in runtime/metrics, if exported) rises when GOMAXPROCS is throttled by the cgroup CPU limit — a container with a 1-core limit but GOMAXPROCS=many will oversubscribe and add scheduling delay. Set GOMAXPROCS to the limit (or use automaxprocs).
 
 ## Investigation Playbook
@@ -59,13 +59,13 @@ You are a Go runtime analyst. Go's runtime exposes itself well: the `go_*` Prome
 
 1. `prom.query_range` on `go_goroutines` over hours. **Monotonic rise that never recovers across GC cycles = leak.** Correlate with `go_threads` and heap — a goroutine leak usually drags memory up with it. This alone often closes the case; the fix is a missing `context` cancellation / unbounded channel send.
 2. GC pressure: `prom.query_range` on `rate(go_memstats_alloc_bytes_total[5m])` (bytes allocated/sec) and the GC pause quantiles. High alloc rate + rising GC CPU + latency on hot paths ⇒ allocation churn. Estimate GC CPU fraction; if it's a large share of the limit, the app is paying for garbage.
-3. CPU-bound: container CPU pinned at the limit with `rate(container_cpu_throttled_seconds_total[5m])` > 0.25 ⇒ throttled; verify GOMAXPROCS vs. the CPU limit (oversubscription adds scheduler latency and wasted context switches).
+3. CPU-bound: container CPU pinned at the limit with a throttle **ratio** `rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m])` > 0.25 (i.e. >25% of CFS periods throttled) ⇒ throttled; verify GOMAXPROCS vs. the CPU limit (oversubscription adds scheduler latency and wasted context switches).
 
 ### Step 3 — Name the hot path (pprof)
 
 When metrics point at CPU or allocation but not a function, capture a profile. **`perf.go_pprof_cpu` is RiskHigh — operator must approve, and the binary must serve `net/http/pprof` (the `/debug/pprof/` endpoint), port-forwarded.**
 
-1. `perf.go_pprof_cpu url=<pprof-host:port> seconds=15 top_n=20` (it hits `/debug/pprof/profile?seconds=N`).
+1. `perf.go_pprof_cpu name=<configured-pprof-endpoint> duration_seconds=15 top_n=20` (the endpoint is a named entry in config — omit `name` if exactly one is configured; the tool hits `/debug/pprof/profile?seconds=N` for you).
 2. Read top functions by flat and cumulative samples:
    - `runtime.mallocgc` / `runtime.gcBgMarkWorker` / `runtime.scanobject` high ⇒ confirms the allocation/GC story; the *caller* allocating is the target.
    - `runtime.gcAssistAlloc` high ⇒ mutators are mark-assisting — allocation is outrunning the collector; reduce allocs or raise GOGC/GOMEMLIMIT.

diff --git a/internal/core/skills/builtin/native-perf.md b/internal/core/skills/builtin/native-perf.md
@@ -55,7 +55,7 @@ A CPU-bound native hotspot is rarely "the function is just slow." It's usually o
 
 **`perf.linux_perf_record` is RiskHigh — operator must approve; it runs `perf record -g` on a PID for a duration, then renders the call graph via `perf report --stdio`. The node needs `perf` and adequate `perf_event_paranoid`.**
 
-1. `perf.linux_perf_record pid=<target-pid> duration=15` during the symptom.
+1. `perf.linux_perf_record pid=<target-pid> duration_seconds=15` during the symptom (optional `frequency_hz`, default 99).
 2. Read the call-graph report:
    - The function with the highest self (flat) percentage is the hotspot; its callers give the context.
    - Time in `memcpy`/`memmove`/allocator (`malloc`/`free`/`tcmalloc`) ⇒ allocation or copy churn — reduce copies, reuse buffers.

diff --git a/internal/core/skills/builtin/node-runtime.md b/internal/core/skills/builtin/node-runtime.md
@@ -2,9 +2,9 @@
 name: node-runtime
 description: Diagnose Node.js / V8 performance — event-loop lag, garbage-collection pauses (scavenge vs. mark-sweep-compact), TurboFan deoptimisation, and CPU-bound handlers — using Prometheus runtime metrics and on-demand V8 Inspector CPU profiles. Read-only.
 triggers:
-  - node
   - nodejs
   - node.js
+  - node runtime
   - event loop
   - event loop lag
   - libuv
@@ -54,21 +54,21 @@ You are a Node.js / V8 runtime analyst. Node is single-threaded for JavaScript:
    - `nodejs_eventloop_lag_seconds` / `nodejs_eventloop_lag_p99_seconds` — the headline signal.
    - `nodejs_gc_duration_seconds` (labelled by `kind`: `scavenge` / `markSweepCompact` / `incremental`).
    - `nodejs_heap_size_used_bytes`, `nodejs_heap_size_total_bytes`, `nodejs_external_memory_bytes`.
-   - `nodejs_active_handles_total`, `nodejs_active_requests_total` (a monotonic climb = handle/request leak).
+   - `nodejs_active_handles` / `nodejs_active_handles_total`, `nodejs_active_requests` / `nodejs_active_requests_total` — these are **gauges** (live count now, despite the `_total` suffix), so judge a *non-recovering* rise net of traffic, not raw growth.
 
 ### Step 2 — Event loop vs. GC vs. CPU
 
 1. `prom.query_range` on `nodejs_eventloop_lag_p99_seconds`. Sustained lag **> 100 ms** with container CPU **below** its limit ⇒ the loop is being blocked synchronously (not a capacity problem). This is the most common "fast but spiky" Node incident.
 2. `prom.query_range` on `rate(nodejs_gc_duration_seconds_sum{kind="markSweepCompact"}[5m])`. If mark-sweep time is a meaningful fraction of wall-clock, the latency IS the GC — correlate the pause timestamps with the lag spikes.
-3. Heap trend: `nodejs_heap_size_used_bytes` climbing toward `--max-old-space-size` (default ~1.5 GB on 64-bit unless set explicitly; Node does not reliably read the container memory limit on its own, so an unset flag in a large-limit container can still cap the old space at the V8 default) ⇒ leak or unbounded cache; expect lengthening mark-sweep pauses then OOM. Hand off to `oom-killed-triage` if it's already being killed.
-4. `nodejs_active_handles_total` / `nodejs_active_requests_total` rising without bound ⇒ unclosed sockets/timers — a leak that also degrades the loop.
+3. Heap trend: `nodejs_heap_size_used_bytes` climbing toward the old-space cap ⇒ leak or unbounded cache; expect lengthening mark-sweep pauses then OOM. The cap is whatever `--max-old-space-size` is set to; if unset, modern Node (≥ 12) derives a default from available/cgroup-visible memory (often ~2 GB+ on a sizeable container, not a fixed 1.4–1.5 GB), so confirm the actual `--max-old-space-size` flag and the container memory limit rather than assuming a number. Hand off to `oom-killed-triage` if it's already being killed.
+4. The `nodejs_active_handles*` / `nodejs_active_requests*` gauges climbing and *not receding* as traffic falls ⇒ unclosed sockets/timers — a leak that also degrades the loop.
 
 ### Step 3 — Name the hot function (V8 CPU profile)
 
 When metrics localise the problem to a process but not a call site, profile it. **`perf.v8_inspector_*` is RiskHigh — the operator must approve, and the process must expose the inspector (`node --inspect=0.0.0.0:9229`, port-forwarded).**
 
-1. `perf.v8_inspector_targets url=<inspector-host:port>` to enumerate debug targets and get the WebSocket debugger URL.
-2. `perf.v8_inspector_cpu_profile url=<ws-url> top_n=20` to capture a CPU profile. Read the top functions by `hitCount`:
+1. `perf.v8_inspector_targets name=<configured-inspector-endpoint>` (the endpoint is a named entry in `node_inspectors` config — omit `name` if exactly one is configured) to enumerate debug targets.
+2. `perf.v8_inspector_cpu_profile name=<endpoint> target_index=<n> duration_seconds=15 top_n=20` to capture a CPU profile (`target_index` picks the target from the enumerated list, default 0). Read the top functions by `hitCount`:
    - A user function dominating ⇒ a CPU-bound handler on the loop; that's the block.
    - `(garbage collector)` high ⇒ confirms the GC story from Step 2.
    - A function you'd expect to be fast sitting hot ⇒ suspect a TurboFan deopt; recommend stabilising its argument shapes / avoiding polymorphism on the hot path.

diff --git a/internal/core/skills/builtin/slo-burn.md b/internal/core/skills/builtin/slo-burn.md
@@ -62,10 +62,12 @@ For an SLO with target `T` (e.g. 99.9%) over a window `W` (e.g. 30 days):
 
 1. `prom.query_range` the error ratio over the full SLO window to integrate consumed budget: `consumed = Σ(bad) / Σ(total)` against the allowed `1 - T`.
 2. `budget_remaining_pct = (1 - consumed/(1-T)) * 100`.
-3. `time_to_exhaustion = remaining_budget / current_burn_rate`, expressed in hours/days at the *current* fast-window rate. If burn rate < 1, the budget is not draining within the window — say "not on track to breach".
+3. `time_to_exhaustion = W × remaining_budget_fraction / current_burn_rate` — the window `W` carries the time dimension, so it MUST be in the formula (burn rate is dimensionless). E.g. remaining 50% of a 30d budget at burn 4× → 30d × 0.5 / 4 = 3.75 days. If burn rate < 1, the budget is not draining within the window — say "not on track to breach".
 
 ### Step 4 — Verdict (fixed output shape)
 
+Emit this only once `T`, `W`, and the SLI type are known — from the operator or a recording rule. If they're still unknown after Step 1, do NOT fill the `<T>`/`<W>` slots with a guess; stop and ask instead (see the constraint below).
+
 ```
 SLO:           <service> <SLI type> target <T>% over <W>
 Burn rate:     5m=<x>×  1h=<y>×  | 30m=<a>×  6h=<b>×
@@ -81,4 +83,4 @@ Watch:         <one prom.query_range the on-call should keep open>
 - **Burn rate without a window pair is meaningless.** Never declare "page now" from a single short window — that's how you train alert fatigue. Require both halves of a pair.
 - **Don't invent the target.** If `T` and `W` aren't given and no recording rule encodes them, ask — a 99.9%/30d budget and a 99.99%/7d budget yield opposite verdicts on the same error ratio.
 - **Reuse the team's recording rules** when `alert.list_rules` exposes them; your ad-hoc PromQL must not contradict the alerts that actually page.
-- Read-only: you report the burn, you do not silence alerts or edit rules. Pivot to `incident-context` when a fast-burn coincides with firing alerts to find the proximate cause.
+- **Never silence or edit (read-only).** Do not recommend `amtool silence add`, acking/silencing the burn alert, or editing the SLO recording/alert rules to make the page stop — that hides budget loss instead of addressing it. You report the burn; a human decides what to mute. Pivot to `incident-context` when a fast-burn coincides with firing alerts to find the proximate cause.