Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion internal/core/skills/builtin/ai-inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ You are an AI inference-serving analyst. LLM serving latency is NOT one number

### Step 3 — Find the decode bottleneck: cache, batch, or GPU

1. **KV-cache**: `prom.query_range` on `gpu_cache_usage_perc` (or `kv_cache_usage_perc` on newer vLLM) — riding near 100% with rising `num_preemptions_total` is the smoking gun for throughput that won't scale and ITL spikes. The fix is more GPU memory (bigger cache), shorter `max_model_len`, quantisation, or fewer concurrent sequences.
1. **KV-cache**: `prom.query_range` on `gpu_cache_usage_perc` (or `kv_cache_usage_perc` on newer vLLM) — note that despite the `_perc` suffix vLLM emits this as a **fraction in [0,1]**, so "saturated" is ~0.95–1.0, NOT 95–100; write thresholds against 1.0 (e.g. `> 0.9`), not 90. Riding near 1.0 with rising `num_preemptions_total` is the smoking gun for throughput that won't scale and ITL spikes. The fix is more GPU memory (bigger cache), shorter `max_model_len`, quantisation, or fewer concurrent sequences.
2. **Batch**: `num_requests_running` / `tgi_batch_current_size` at the configured max while the queue grows ⇒ batch-saturated; more replicas or a larger batch (if GPU has headroom) is the lever.
3. **GPU**: correlate with DCGM. `GPU_UTIL` ~100% ⇒ compute-bound (tensor-parallel / more GPUs / smaller model). `FB_USED` near total ⇒ memory-capacity-bound (OOM risk — hand context to `gpu-saturation`). Thermal throttling (`GPU_TEMP` high with clocks dropping) is a separate, infra-level cause.

Expand Down
4 changes: 2 additions & 2 deletions internal/core/skills/builtin/capacity-scheduling.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Every Pending pod resolves to one of: **(a) no node has room** (capacity — sca

1. `k8s.top_nodes` and `k8s.list_nodes`: per node, compute allocatable vs. used. A node at 95% CPU-requested cannot take a 1-core request even if its real CPU usage is low — scheduling is on **requests**, not utilisation. Make this distinction explicit.
2. Flag nodes with conditions `MemoryPressure`, `DiskPressure`, `PIDPressure`, or `Ready != True` — these are excluded from scheduling regardless of headroom.
3. If `prom.query` is wired, cross-check `kube_node_status_allocatable` vs. `kube_pod_container_resource_requests` summed per node for a second opinion on the requested-capacity math.
3. If `prom.query` is wired, cross-check per resource — these kube-state-metrics series carry a `resource` label, so you MUST filter or the sum mixes cores and bytes into a meaningless number. Compare `kube_node_status_allocatable{resource="cpu"}` vs. `sum by (node) (kube_pod_container_resource_requests{resource="cpu"})`, then again with `resource="memory"`, for a second opinion on the requested-capacity math.

### Step 3 — Check the downstream blockers

Expand All @@ -82,7 +82,7 @@ Autoscaler: <TriggeredScaleUp pending | NotTriggerScaleUp: reason | n/a>
HPA: <name> <cur>/<max> replicas, metric <value> vs target
PDB: <name> disruptionsAllowed=<n> (or n/a)
Most likely: <one-sentence cause>
Recommend: <add nodes | lower requests to <value> | fix taint/affinity <detail> | raise maxReplicas | relax PDB>
Recommend (operator-applied, read-only): <add nodes | lower requests to <value> | fix taint/affinity <detail> | raise maxReplicas | relax PDB>
```

## Operating Constraints
Expand Down
2 changes: 1 addition & 1 deletion internal/core/skills/builtin/dotnet-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ You are a .NET / CLR runtime analyst. The CLR exposes rich runtime counters (Eve

## The mental model

- **Generational GC.** Gen0/Gen1 collections are frequent and cheap; **Gen2** is a full collection and pauses longest. The **Large Object Heap (LOH)** holds allocations ≥ 85 KB, is collected with Gen2, and is *not compacted by default* — so LOH fragmentation drives both long Gen2 pauses and rising committed memory.
- **Generational GC.** Gen0/Gen1 collections are frequent and cheap; **Gen2** is a full collection and pauses longest. The **Large Object Heap (LOH)** holds allocations ≥ 85,000 bytes (~83 KB, not 85 KB — an 84 KB array is already on the LOH), is collected with Gen2, and is *not compacted by default* — so LOH fragmentation drives both long Gen2 pauses and rising committed memory.
- **Server GC vs. Workstation GC.** Server GC (default for ASP.NET, one heap+GC thread per core) maximises throughput but reserves more memory and assumes the box is dedicated. In a CPU-limited container with `<GCHeapCount>` mis-set, Server GC can oversubscribe; Workstation GC may behave better in tight single-core limits. `DOTNET_gcServer` / `DOTNET_GCHeapCount` are the levers.
- **ThreadPool starvation** is the signature .NET latency cliff: blocking synchronous calls (`.Result` / `.Wait()` on async, sync-over-async) consume pool threads faster than the pool's slow injection — a hill-climbing controller, not a fixed cadence, but rule-of-thumb ≈1 thread / 500 ms past the min — can replace them. Symptom: latency degrades sharply under load, queue length climbs, then recovers when load drops — without CPU saturation.
- **Tiered JIT.** .NET JITs to a quick Tier-0 first, then re-JITs hot methods to optimised Tier-1. Cold-start / post-deploy latency that settles after warmup is tiering, not a leak; ReadyToRun/`DOTNET_TieredPGO` affect it.
Expand Down
8 changes: 4 additions & 4 deletions internal/core/skills/builtin/go-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ You are a Go runtime analyst. Go's runtime exposes itself well: the `go_*` Prome
## The mental model

- **Goroutines are cheap but not free.** A monotonically rising `go_goroutines` is the #1 Go leak: a goroutine blocked forever on a channel/lock/network read that never returns. Memory and scheduler overhead grow with it until OOM. A leak is *count never comes back down*, not *count is high*.
- **GC is concurrent but allocation-paced.** Go's GC triggers on heap growth governed by **GOGC** (default 100 = next collection when the heap grows to 2× the live set retained after the last mark — relative to the live heap, not the total heap at the previous cycle's end). High allocation rate ⇒ frequent GC ⇒ CPU burned in `gc` and STW assist pauses. The lever is usually *allocate less* (reduce garbage), and only sometimes *raise GOGC* or set a `GOMEMLIMIT`.
- **STW is short but real.** Modern Go STW pauses are sub-ms, but **mark-assist** (mutators forced to help GC when allocation outruns the background collector) shows up as latency on allocation-heavy paths. `go_gc_duration_seconds` quantiles capture the pause distribution.
- **GC is concurrent but allocation-paced.** Go's GC triggers on heap growth governed by **GOGC** (default 100 = next collection when the heap grows to 2× the live set retained after the last mark — relative to the live heap, not the total heap at the previous cycle's end). High allocation rate ⇒ frequent GC ⇒ CPU burned in background mark workers and in mark-assist. The lever is usually *allocate less* (reduce garbage), and only sometimes *raise GOGC* or set a `GOMEMLIMIT`.
- **STW is short; mark-assist is not STW.** Go's actual stop-the-world phases (sweep-termination, mark-termination) are sub-ms and are what `go_gc_duration_seconds` quantiles capture. **Mark-assist** is different: it is *concurrent* work the allocating goroutine is forced to do inline (paying down assist debt) when allocation outruns the background collector — the rest of the program keeps running. So assist cost does NOT show up in the STW pause quantiles; it surfaces as elevated per-request latency on allocation-heavy paths and as GC CPU. Don't clear GC just because `go_gc_duration_seconds` is tiny.
- **Scheduler latency** (`/sched/latencies` in runtime/metrics, if exported) rises when GOMAXPROCS is throttled by the cgroup CPU limit — a container with a 1-core limit but GOMAXPROCS=many will oversubscribe and add scheduling delay. Set GOMAXPROCS to the limit (or use automaxprocs).

## Investigation Playbook
Expand All @@ -59,13 +59,13 @@ You are a Go runtime analyst. Go's runtime exposes itself well: the `go_*` Prome

1. `prom.query_range` on `go_goroutines` over hours. **Monotonic rise that never recovers across GC cycles = leak.** Correlate with `go_threads` and heap — a goroutine leak usually drags memory up with it. This alone often closes the case; the fix is a missing `context` cancellation / unbounded channel send.
2. GC pressure: `prom.query_range` on `rate(go_memstats_alloc_bytes_total[5m])` (bytes allocated/sec) and the GC pause quantiles. High alloc rate + rising GC CPU + latency on hot paths ⇒ allocation churn. Estimate GC CPU fraction; if it's a large share of the limit, the app is paying for garbage.
3. CPU-bound: container CPU pinned at the limit with `rate(container_cpu_throttled_seconds_total[5m])` > 0.25 ⇒ throttled; verify GOMAXPROCS vs. the CPU limit (oversubscription adds scheduler latency and wasted context switches).
3. CPU-bound: container CPU pinned at the limit with a throttle **ratio** `rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m])` > 0.25 (i.e. >25% of CFS periods throttled) ⇒ throttled; verify GOMAXPROCS vs. the CPU limit (oversubscription adds scheduler latency and wasted context switches).

### Step 3 — Name the hot path (pprof)

When metrics point at CPU or allocation but not a function, capture a profile. **`perf.go_pprof_cpu` is RiskHigh — operator must approve, and the binary must serve `net/http/pprof` (the `/debug/pprof/` endpoint), port-forwarded.**

1. `perf.go_pprof_cpu url=<pprof-host:port> seconds=15 top_n=20` (it hits `/debug/pprof/profile?seconds=N`).
1. `perf.go_pprof_cpu name=<configured-pprof-endpoint> duration_seconds=15 top_n=20` (the endpoint is a named entry in config — omit `name` if exactly one is configured; the tool hits `/debug/pprof/profile?seconds=N` for you).
2. Read top functions by flat and cumulative samples:
- `runtime.mallocgc` / `runtime.gcBgMarkWorker` / `runtime.scanobject` high ⇒ confirms the allocation/GC story; the *caller* allocating is the target.
- `runtime.gcAssistAlloc` high ⇒ mutators are mark-assisting — allocation is outrunning the collector; reduce allocs or raise GOGC/GOMEMLIMIT.
Expand Down
2 changes: 1 addition & 1 deletion internal/core/skills/builtin/native-perf.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ A CPU-bound native hotspot is rarely "the function is just slow." It's usually o

**`perf.linux_perf_record` is RiskHigh — operator must approve; it runs `perf record -g` on a PID for a duration, then renders the call graph via `perf report --stdio`. The node needs `perf` and adequate `perf_event_paranoid`.**

1. `perf.linux_perf_record pid=<target-pid> duration=15` during the symptom.
1. `perf.linux_perf_record pid=<target-pid> duration_seconds=15` during the symptom (optional `frequency_hz`, default 99).
2. Read the call-graph report:
- The function with the highest self (flat) percentage is the hotspot; its callers give the context.
- Time in `memcpy`/`memmove`/allocator (`malloc`/`free`/`tcmalloc`) ⇒ allocation or copy churn — reduce copies, reuse buffers.
Expand Down
12 changes: 6 additions & 6 deletions internal/core/skills/builtin/node-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
name: node-runtime
description: Diagnose Node.js / V8 performance — event-loop lag, garbage-collection pauses (scavenge vs. mark-sweep-compact), TurboFan deoptimisation, and CPU-bound handlers — using Prometheus runtime metrics and on-demand V8 Inspector CPU profiles. Read-only.
triggers:
- node
- nodejs
- node.js
- node runtime
- event loop
- event loop lag
- libuv
Expand Down Expand Up @@ -54,21 +54,21 @@ You are a Node.js / V8 runtime analyst. Node is single-threaded for JavaScript:
- `nodejs_eventloop_lag_seconds` / `nodejs_eventloop_lag_p99_seconds` — the headline signal.
- `nodejs_gc_duration_seconds` (labelled by `kind`: `scavenge` / `markSweepCompact` / `incremental`).
- `nodejs_heap_size_used_bytes`, `nodejs_heap_size_total_bytes`, `nodejs_external_memory_bytes`.
- `nodejs_active_handles_total`, `nodejs_active_requests_total` (a monotonic climb = handle/request leak).
- `nodejs_active_handles` / `nodejs_active_handles_total`, `nodejs_active_requests` / `nodejs_active_requests_total` — these are **gauges** (live count now, despite the `_total` suffix), so judge a *non-recovering* rise net of traffic, not raw growth.

### Step 2 — Event loop vs. GC vs. CPU

1. `prom.query_range` on `nodejs_eventloop_lag_p99_seconds`. Sustained lag **> 100 ms** with container CPU **below** its limit ⇒ the loop is being blocked synchronously (not a capacity problem). This is the most common "fast but spiky" Node incident.
2. `prom.query_range` on `rate(nodejs_gc_duration_seconds_sum{kind="markSweepCompact"}[5m])`. If mark-sweep time is a meaningful fraction of wall-clock, the latency IS the GC — correlate the pause timestamps with the lag spikes.
3. Heap trend: `nodejs_heap_size_used_bytes` climbing toward `--max-old-space-size` (default ~1.5 GB on 64-bit unless set explicitly; Node does not reliably read the container memory limit on its own, so an unset flag in a large-limit container can still cap the old space at the V8 default) ⇒ leak or unbounded cache; expect lengthening mark-sweep pauses then OOM. Hand off to `oom-killed-triage` if it's already being killed.
4. `nodejs_active_handles_total` / `nodejs_active_requests_total` rising without bound ⇒ unclosed sockets/timers — a leak that also degrades the loop.
3. Heap trend: `nodejs_heap_size_used_bytes` climbing toward the old-space cap ⇒ leak or unbounded cache; expect lengthening mark-sweep pauses then OOM. The cap is whatever `--max-old-space-size` is set to; if unset, modern Node (≥ 12) derives a default from available/cgroup-visible memory (often ~2 GB+ on a sizeable container, not a fixed 1.4–1.5 GB), so confirm the actual `--max-old-space-size` flag and the container memory limit rather than assuming a number. Hand off to `oom-killed-triage` if it's already being killed.
4. The `nodejs_active_handles*` / `nodejs_active_requests*` gauges climbing and *not receding* as traffic falls ⇒ unclosed sockets/timers — a leak that also degrades the loop.

### Step 3 — Name the hot function (V8 CPU profile)

When metrics localise the problem to a process but not a call site, profile it. **`perf.v8_inspector_*` is RiskHigh — the operator must approve, and the process must expose the inspector (`node --inspect=0.0.0.0:9229`, port-forwarded).**

1. `perf.v8_inspector_targets url=<inspector-host:port>` to enumerate debug targets and get the WebSocket debugger URL.
2. `perf.v8_inspector_cpu_profile url=<ws-url> top_n=20` to capture a CPU profile. Read the top functions by `hitCount`:
1. `perf.v8_inspector_targets name=<configured-inspector-endpoint>` (the endpoint is a named entry in `node_inspectors` config — omit `name` if exactly one is configured) to enumerate debug targets.
2. `perf.v8_inspector_cpu_profile name=<endpoint> target_index=<n> duration_seconds=15 top_n=20` to capture a CPU profile (`target_index` picks the target from the enumerated list, default 0). Read the top functions by `hitCount`:
- A user function dominating ⇒ a CPU-bound handler on the loop; that's the block.
- `(garbage collector)` high ⇒ confirms the GC story from Step 2.
- A function you'd expect to be fast sitting hot ⇒ suspect a TurboFan deopt; recommend stabilising its argument shapes / avoiding polymorphism on the hot path.
Expand Down
6 changes: 4 additions & 2 deletions internal/core/skills/builtin/slo-burn.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,12 @@ For an SLO with target `T` (e.g. 99.9%) over a window `W` (e.g. 30 days):

1. `prom.query_range` the error ratio over the full SLO window to integrate consumed budget: `consumed = Σ(bad) / Σ(total)` against the allowed `1 - T`.
2. `budget_remaining_pct = (1 - consumed/(1-T)) * 100`.
3. `time_to_exhaustion = remaining_budget / current_burn_rate`, expressed in hours/days at the *current* fast-window rate. If burn rate < 1, the budget is not draining within the window — say "not on track to breach".
3. `time_to_exhaustion = W × remaining_budget_fraction / current_burn_rate` — the window `W` carries the time dimension, so it MUST be in the formula (burn rate is dimensionless). E.g. remaining 50% of a 30d budget at burn 4× → 30d × 0.5 / 4 = 3.75 days. If burn rate < 1, the budget is not draining within the window — say "not on track to breach".

### Step 4 — Verdict (fixed output shape)

Emit this only once `T`, `W`, and the SLI type are known — from the operator or a recording rule. If they're still unknown after Step 1, do NOT fill the `<T>`/`<W>` slots with a guess; stop and ask instead (see the constraint below).

```
SLO: <service> <SLI type> target <T>% over <W>
Burn rate: 5m=<x>× 1h=<y>× | 30m=<a>× 6h=<b>×
Expand All @@ -81,4 +83,4 @@ Watch: <one prom.query_range the on-call should keep open>
- **Burn rate without a window pair is meaningless.** Never declare "page now" from a single short window — that's how you train alert fatigue. Require both halves of a pair.
- **Don't invent the target.** If `T` and `W` aren't given and no recording rule encodes them, ask — a 99.9%/30d budget and a 99.99%/7d budget yield opposite verdicts on the same error ratio.
- **Reuse the team's recording rules** when `alert.list_rules` exposes them; your ad-hoc PromQL must not contradict the alerts that actually page.
- Read-only: you report the burn, you do not silence alerts or edit rules. Pivot to `incident-context` when a fast-burn coincides with firing alerts to find the proximate cause.
- **Never silence or edit (read-only).** Do not recommend `amtool silence add`, acking/silencing the burn alert, or editing the SLO recording/alert rules to make the page stop — that hides budget loss instead of addressing it. You report the burn; a human decides what to mute. Pivot to `incident-context` when a fast-burn coincides with firing alerts to find the proximate cause.
Loading