diff --git a/internal/core/skills/builtin/ai-inference.md b/internal/core/skills/builtin/ai-inference.md index d5cbc78..cb9a1ba 100644 --- a/internal/core/skills/builtin/ai-inference.md +++ b/internal/core/skills/builtin/ai-inference.md @@ -69,7 +69,7 @@ You are an AI inference-serving analyst. LLM serving latency is NOT one number ### Step 3 — Find the decode bottleneck: cache, batch, or GPU -1. **KV-cache**: `prom.query_range` on `gpu_cache_usage_perc` (or `kv_cache_usage_perc` on newer vLLM) — riding near 100% with rising `num_preemptions_total` is the smoking gun for throughput that won't scale and ITL spikes. The fix is more GPU memory (bigger cache), shorter `max_model_len`, quantisation, or fewer concurrent sequences. +1. **KV-cache**: `prom.query_range` on `gpu_cache_usage_perc` (or `kv_cache_usage_perc` on newer vLLM) — note that despite the `_perc` suffix vLLM emits this as a **fraction in [0,1]**, so "saturated" is ~0.95–1.0, NOT 95–100; write thresholds against 1.0 (e.g. `> 0.9`), not 90. Riding near 1.0 with rising `num_preemptions_total` is the smoking gun for throughput that won't scale and ITL spikes. The fix is more GPU memory (bigger cache), shorter `max_model_len`, quantisation, or fewer concurrent sequences. 2. **Batch**: `num_requests_running` / `tgi_batch_current_size` at the configured max while the queue grows ⇒ batch-saturated; more replicas or a larger batch (if GPU has headroom) is the lever. 3. **GPU**: correlate with DCGM. `GPU_UTIL` ~100% ⇒ compute-bound (tensor-parallel / more GPUs / smaller model). `FB_USED` near total ⇒ memory-capacity-bound (OOM risk — hand context to `gpu-saturation`). Thermal throttling (`GPU_TEMP` high with clocks dropping) is a separate, infra-level cause. diff --git a/internal/core/skills/builtin/capacity-scheduling.md b/internal/core/skills/builtin/capacity-scheduling.md index a42acaf..3c4f5cc 100644 --- a/internal/core/skills/builtin/capacity-scheduling.md +++ b/internal/core/skills/builtin/capacity-scheduling.md @@ -62,7 +62,7 @@ Every Pending pod resolves to one of: **(a) no node has room** (capacity — sca 1. `k8s.top_nodes` and `k8s.list_nodes`: per node, compute allocatable vs. used. A node at 95% CPU-requested cannot take a 1-core request even if its real CPU usage is low — scheduling is on **requests**, not utilisation. Make this distinction explicit. 2. Flag nodes with conditions `MemoryPressure`, `DiskPressure`, `PIDPressure`, or `Ready != True` — these are excluded from scheduling regardless of headroom. -3. If `prom.query` is wired, cross-check `kube_node_status_allocatable` vs. `kube_pod_container_resource_requests` summed per node for a second opinion on the requested-capacity math. +3. If `prom.query` is wired, cross-check per resource — these kube-state-metrics series carry a `resource` label, so you MUST filter or the sum mixes cores and bytes into a meaningless number. Compare `kube_node_status_allocatable{resource="cpu"}` vs. `sum by (node) (kube_pod_container_resource_requests{resource="cpu"})`, then again with `resource="memory"`, for a second opinion on the requested-capacity math. ### Step 3 — Check the downstream blockers @@ -82,7 +82,7 @@ Autoscaler: HPA: / replicas, metric vs target PDB: disruptionsAllowed= (or n/a) Most likely: -Recommend: | fix taint/affinity | raise maxReplicas | relax PDB> +Recommend (operator-applied, read-only): | fix taint/affinity | raise maxReplicas | relax PDB> ``` ## Operating Constraints diff --git a/internal/core/skills/builtin/dotnet-runtime.md b/internal/core/skills/builtin/dotnet-runtime.md index a48c013..9a748b2 100644 --- a/internal/core/skills/builtin/dotnet-runtime.md +++ b/internal/core/skills/builtin/dotnet-runtime.md @@ -38,7 +38,7 @@ You are a .NET / CLR runtime analyst. The CLR exposes rich runtime counters (Eve ## The mental model -- **Generational GC.** Gen0/Gen1 collections are frequent and cheap; **Gen2** is a full collection and pauses longest. The **Large Object Heap (LOH)** holds allocations ≥ 85 KB, is collected with Gen2, and is *not compacted by default* — so LOH fragmentation drives both long Gen2 pauses and rising committed memory. +- **Generational GC.** Gen0/Gen1 collections are frequent and cheap; **Gen2** is a full collection and pauses longest. The **Large Object Heap (LOH)** holds allocations ≥ 85,000 bytes (~83 KB, not 85 KB — an 84 KB array is already on the LOH), is collected with Gen2, and is *not compacted by default* — so LOH fragmentation drives both long Gen2 pauses and rising committed memory. - **Server GC vs. Workstation GC.** Server GC (default for ASP.NET, one heap+GC thread per core) maximises throughput but reserves more memory and assumes the box is dedicated. In a CPU-limited container with `` mis-set, Server GC can oversubscribe; Workstation GC may behave better in tight single-core limits. `DOTNET_gcServer` / `DOTNET_GCHeapCount` are the levers. - **ThreadPool starvation** is the signature .NET latency cliff: blocking synchronous calls (`.Result` / `.Wait()` on async, sync-over-async) consume pool threads faster than the pool's slow injection — a hill-climbing controller, not a fixed cadence, but rule-of-thumb ≈1 thread / 500 ms past the min — can replace them. Symptom: latency degrades sharply under load, queue length climbs, then recovers when load drops — without CPU saturation. - **Tiered JIT.** .NET JITs to a quick Tier-0 first, then re-JITs hot methods to optimised Tier-1. Cold-start / post-deploy latency that settles after warmup is tiering, not a leak; ReadyToRun/`DOTNET_TieredPGO` affect it. diff --git a/internal/core/skills/builtin/go-runtime.md b/internal/core/skills/builtin/go-runtime.md index 3b0c184..8823475 100644 --- a/internal/core/skills/builtin/go-runtime.md +++ b/internal/core/skills/builtin/go-runtime.md @@ -39,8 +39,8 @@ You are a Go runtime analyst. Go's runtime exposes itself well: the `go_*` Prome ## The mental model - **Goroutines are cheap but not free.** A monotonically rising `go_goroutines` is the #1 Go leak: a goroutine blocked forever on a channel/lock/network read that never returns. Memory and scheduler overhead grow with it until OOM. A leak is *count never comes back down*, not *count is high*. -- **GC is concurrent but allocation-paced.** Go's GC triggers on heap growth governed by **GOGC** (default 100 = next collection when the heap grows to 2× the live set retained after the last mark — relative to the live heap, not the total heap at the previous cycle's end). High allocation rate ⇒ frequent GC ⇒ CPU burned in `gc` and STW assist pauses. The lever is usually *allocate less* (reduce garbage), and only sometimes *raise GOGC* or set a `GOMEMLIMIT`. -- **STW is short but real.** Modern Go STW pauses are sub-ms, but **mark-assist** (mutators forced to help GC when allocation outruns the background collector) shows up as latency on allocation-heavy paths. `go_gc_duration_seconds` quantiles capture the pause distribution. +- **GC is concurrent but allocation-paced.** Go's GC triggers on heap growth governed by **GOGC** (default 100 = next collection when the heap grows to 2× the live set retained after the last mark — relative to the live heap, not the total heap at the previous cycle's end). High allocation rate ⇒ frequent GC ⇒ CPU burned in background mark workers and in mark-assist. The lever is usually *allocate less* (reduce garbage), and only sometimes *raise GOGC* or set a `GOMEMLIMIT`. +- **STW is short; mark-assist is not STW.** Go's actual stop-the-world phases (sweep-termination, mark-termination) are sub-ms and are what `go_gc_duration_seconds` quantiles capture. **Mark-assist** is different: it is *concurrent* work the allocating goroutine is forced to do inline (paying down assist debt) when allocation outruns the background collector — the rest of the program keeps running. So assist cost does NOT show up in the STW pause quantiles; it surfaces as elevated per-request latency on allocation-heavy paths and as GC CPU. Don't clear GC just because `go_gc_duration_seconds` is tiny. - **Scheduler latency** (`/sched/latencies` in runtime/metrics, if exported) rises when GOMAXPROCS is throttled by the cgroup CPU limit — a container with a 1-core limit but GOMAXPROCS=many will oversubscribe and add scheduling delay. Set GOMAXPROCS to the limit (or use automaxprocs). ## Investigation Playbook @@ -59,13 +59,13 @@ You are a Go runtime analyst. Go's runtime exposes itself well: the `go_*` Prome 1. `prom.query_range` on `go_goroutines` over hours. **Monotonic rise that never recovers across GC cycles = leak.** Correlate with `go_threads` and heap — a goroutine leak usually drags memory up with it. This alone often closes the case; the fix is a missing `context` cancellation / unbounded channel send. 2. GC pressure: `prom.query_range` on `rate(go_memstats_alloc_bytes_total[5m])` (bytes allocated/sec) and the GC pause quantiles. High alloc rate + rising GC CPU + latency on hot paths ⇒ allocation churn. Estimate GC CPU fraction; if it's a large share of the limit, the app is paying for garbage. -3. CPU-bound: container CPU pinned at the limit with `rate(container_cpu_throttled_seconds_total[5m])` > 0.25 ⇒ throttled; verify GOMAXPROCS vs. the CPU limit (oversubscription adds scheduler latency and wasted context switches). +3. CPU-bound: container CPU pinned at the limit with a throttle **ratio** `rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m])` > 0.25 (i.e. >25% of CFS periods throttled) ⇒ throttled; verify GOMAXPROCS vs. the CPU limit (oversubscription adds scheduler latency and wasted context switches). ### Step 3 — Name the hot path (pprof) When metrics point at CPU or allocation but not a function, capture a profile. **`perf.go_pprof_cpu` is RiskHigh — operator must approve, and the binary must serve `net/http/pprof` (the `/debug/pprof/` endpoint), port-forwarded.** -1. `perf.go_pprof_cpu url= seconds=15 top_n=20` (it hits `/debug/pprof/profile?seconds=N`). +1. `perf.go_pprof_cpu name= duration_seconds=15 top_n=20` (the endpoint is a named entry in config — omit `name` if exactly one is configured; the tool hits `/debug/pprof/profile?seconds=N` for you). 2. Read top functions by flat and cumulative samples: - `runtime.mallocgc` / `runtime.gcBgMarkWorker` / `runtime.scanobject` high ⇒ confirms the allocation/GC story; the *caller* allocating is the target. - `runtime.gcAssistAlloc` high ⇒ mutators are mark-assisting — allocation is outrunning the collector; reduce allocs or raise GOGC/GOMEMLIMIT. diff --git a/internal/core/skills/builtin/native-perf.md b/internal/core/skills/builtin/native-perf.md index 2486842..27b4923 100644 --- a/internal/core/skills/builtin/native-perf.md +++ b/internal/core/skills/builtin/native-perf.md @@ -55,7 +55,7 @@ A CPU-bound native hotspot is rarely "the function is just slow." It's usually o **`perf.linux_perf_record` is RiskHigh — operator must approve; it runs `perf record -g` on a PID for a duration, then renders the call graph via `perf report --stdio`. The node needs `perf` and adequate `perf_event_paranoid`.** -1. `perf.linux_perf_record pid= duration=15` during the symptom. +1. `perf.linux_perf_record pid= duration_seconds=15` during the symptom (optional `frequency_hz`, default 99). 2. Read the call-graph report: - The function with the highest self (flat) percentage is the hotspot; its callers give the context. - Time in `memcpy`/`memmove`/allocator (`malloc`/`free`/`tcmalloc`) ⇒ allocation or copy churn — reduce copies, reuse buffers. diff --git a/internal/core/skills/builtin/node-runtime.md b/internal/core/skills/builtin/node-runtime.md index fefde38..b8ba5ff 100644 --- a/internal/core/skills/builtin/node-runtime.md +++ b/internal/core/skills/builtin/node-runtime.md @@ -2,9 +2,9 @@ name: node-runtime description: Diagnose Node.js / V8 performance — event-loop lag, garbage-collection pauses (scavenge vs. mark-sweep-compact), TurboFan deoptimisation, and CPU-bound handlers — using Prometheus runtime metrics and on-demand V8 Inspector CPU profiles. Read-only. triggers: - - node - nodejs - node.js + - node runtime - event loop - event loop lag - libuv @@ -54,21 +54,21 @@ You are a Node.js / V8 runtime analyst. Node is single-threaded for JavaScript: - `nodejs_eventloop_lag_seconds` / `nodejs_eventloop_lag_p99_seconds` — the headline signal. - `nodejs_gc_duration_seconds` (labelled by `kind`: `scavenge` / `markSweepCompact` / `incremental`). - `nodejs_heap_size_used_bytes`, `nodejs_heap_size_total_bytes`, `nodejs_external_memory_bytes`. - - `nodejs_active_handles_total`, `nodejs_active_requests_total` (a monotonic climb = handle/request leak). + - `nodejs_active_handles` / `nodejs_active_handles_total`, `nodejs_active_requests` / `nodejs_active_requests_total` — these are **gauges** (live count now, despite the `_total` suffix), so judge a *non-recovering* rise net of traffic, not raw growth. ### Step 2 — Event loop vs. GC vs. CPU 1. `prom.query_range` on `nodejs_eventloop_lag_p99_seconds`. Sustained lag **> 100 ms** with container CPU **below** its limit ⇒ the loop is being blocked synchronously (not a capacity problem). This is the most common "fast but spiky" Node incident. 2. `prom.query_range` on `rate(nodejs_gc_duration_seconds_sum{kind="markSweepCompact"}[5m])`. If mark-sweep time is a meaningful fraction of wall-clock, the latency IS the GC — correlate the pause timestamps with the lag spikes. -3. Heap trend: `nodejs_heap_size_used_bytes` climbing toward `--max-old-space-size` (default ~1.5 GB on 64-bit unless set explicitly; Node does not reliably read the container memory limit on its own, so an unset flag in a large-limit container can still cap the old space at the V8 default) ⇒ leak or unbounded cache; expect lengthening mark-sweep pauses then OOM. Hand off to `oom-killed-triage` if it's already being killed. -4. `nodejs_active_handles_total` / `nodejs_active_requests_total` rising without bound ⇒ unclosed sockets/timers — a leak that also degrades the loop. +3. Heap trend: `nodejs_heap_size_used_bytes` climbing toward the old-space cap ⇒ leak or unbounded cache; expect lengthening mark-sweep pauses then OOM. The cap is whatever `--max-old-space-size` is set to; if unset, modern Node (≥ 12) derives a default from available/cgroup-visible memory (often ~2 GB+ on a sizeable container, not a fixed 1.4–1.5 GB), so confirm the actual `--max-old-space-size` flag and the container memory limit rather than assuming a number. Hand off to `oom-killed-triage` if it's already being killed. +4. The `nodejs_active_handles*` / `nodejs_active_requests*` gauges climbing and *not receding* as traffic falls ⇒ unclosed sockets/timers — a leak that also degrades the loop. ### Step 3 — Name the hot function (V8 CPU profile) When metrics localise the problem to a process but not a call site, profile it. **`perf.v8_inspector_*` is RiskHigh — the operator must approve, and the process must expose the inspector (`node --inspect=0.0.0.0:9229`, port-forwarded).** -1. `perf.v8_inspector_targets url=` to enumerate debug targets and get the WebSocket debugger URL. -2. `perf.v8_inspector_cpu_profile url= top_n=20` to capture a CPU profile. Read the top functions by `hitCount`: +1. `perf.v8_inspector_targets name=` (the endpoint is a named entry in `node_inspectors` config — omit `name` if exactly one is configured) to enumerate debug targets. +2. `perf.v8_inspector_cpu_profile name= target_index= duration_seconds=15 top_n=20` to capture a CPU profile (`target_index` picks the target from the enumerated list, default 0). Read the top functions by `hitCount`: - A user function dominating ⇒ a CPU-bound handler on the loop; that's the block. - `(garbage collector)` high ⇒ confirms the GC story from Step 2. - A function you'd expect to be fast sitting hot ⇒ suspect a TurboFan deopt; recommend stabilising its argument shapes / avoiding polymorphism on the hot path. diff --git a/internal/core/skills/builtin/slo-burn.md b/internal/core/skills/builtin/slo-burn.md index f14c339..79eae40 100644 --- a/internal/core/skills/builtin/slo-burn.md +++ b/internal/core/skills/builtin/slo-burn.md @@ -62,10 +62,12 @@ For an SLO with target `T` (e.g. 99.9%) over a window `W` (e.g. 30 days): 1. `prom.query_range` the error ratio over the full SLO window to integrate consumed budget: `consumed = Σ(bad) / Σ(total)` against the allowed `1 - T`. 2. `budget_remaining_pct = (1 - consumed/(1-T)) * 100`. -3. `time_to_exhaustion = remaining_budget / current_burn_rate`, expressed in hours/days at the *current* fast-window rate. If burn rate < 1, the budget is not draining within the window — say "not on track to breach". +3. `time_to_exhaustion = W × remaining_budget_fraction / current_burn_rate` — the window `W` carries the time dimension, so it MUST be in the formula (burn rate is dimensionless). E.g. remaining 50% of a 30d budget at burn 4× → 30d × 0.5 / 4 = 3.75 days. If burn rate < 1, the budget is not draining within the window — say "not on track to breach". ### Step 4 — Verdict (fixed output shape) +Emit this only once `T`, `W`, and the SLI type are known — from the operator or a recording rule. If they're still unknown after Step 1, do NOT fill the ``/`` slots with a guess; stop and ask instead (see the constraint below). + ``` SLO: target % over Burn rate: 5m=× 1h=× | 30m=× 6h=× @@ -81,4 +83,4 @@ Watch: - **Burn rate without a window pair is meaningless.** Never declare "page now" from a single short window — that's how you train alert fatigue. Require both halves of a pair. - **Don't invent the target.** If `T` and `W` aren't given and no recording rule encodes them, ask — a 99.9%/30d budget and a 99.99%/7d budget yield opposite verdicts on the same error ratio. - **Reuse the team's recording rules** when `alert.list_rules` exposes them; your ad-hoc PromQL must not contradict the alerts that actually page. -- Read-only: you report the burn, you do not silence alerts or edit rules. Pivot to `incident-context` when a fast-burn coincides with firing alerts to find the proximate cause. +- **Never silence or edit (read-only).** Do not recommend `amtool silence add`, acking/silencing the burn alert, or editing the SLO recording/alert rules to make the page stop — that hides budget loss instead of addressing it. You report the burn; a human decides what to mute. Pivot to `incident-context` when a fast-burn coincides with firing alerts to find the proximate cause.