Skip to content

feat(tools): correlate metric/log/trace symptoms onto the timeline (v2)#108

Merged
rlaope merged 3 commits into
masterfrom
feat/correlate-symptoms
May 28, 2026
Merged

feat(tools): correlate metric/log/trace symptoms onto the timeline (v2)#108
rlaope merged 3 commits into
masterfrom
feat/correlate-symptoms

Conversation

@rlaope
Copy link
Copy Markdown
Owner

@rlaope rlaope commented May 28, 2026

Summary

  • correlate.workload v2: folds symptom signals onto the change timeline and aligns them with the change that most likely caused them.
    • metric: a PromQL breach (metric_query + metric_threshold args) → metric_breach event at the first sample over threshold (Prometheus range query).
    • log: Loki error-line bursts → log_error event at the first error line (with in-window count).
    • trace: Jaeger error/slow spans → trace_error/trace_slow at span start (adds a read-only JaegerClient.SearchErrorSpans; Tempo deferred to v3).
    • candidate-cause v2: the most recent change strictly before the earliest symptom, falling back to the newest change when there are no symptoms.
  • Symptoms reuse change.ChangeEvent (symptom Kinds + Source) so everything merges on one newest-first timeline. Read-only, RiskLow. Tool name unchanged.

Follow-up to the Phase 4 correlate tool; closes the v2 scope I deferred there.

What's new

  • internal/core/tools/correlate/{metric_source,log_source,trace_source,cause}.go (+ tests); tool.go/register.go extended for symptom sources + metric_query/metric_threshold args.
  • internal/core/tools/trace/jaeger.go — read-only SearchErrorSpans (+ JaegerSpan) exposing span start time + error status.
  • internal/wiring/tools.go — threads prom/log/trace clients into correlate.RegisterAll; correlate registers when any change OR symptom backend exists.

Test plan

  • go build ./...
  • go test ./... (metric breach/threshold; log error-burst onset; jaeger span→event; candidate-cause symptom-alignment incl. earliest-symptom + no-prior-change + fallback; Run renders symptom lines + partial-failure tolerance)
  • golangci-lint v2.12 run ./... → 0 issues; gofmt clean
  • ralph architect verification → APPROVE (6/6 stories); deslop + post-deslop regression green
  • Live QA against real Prometheus/Loki/Jaeger + cluster (not available in dev env)

Deferred to v3 (noted): Tempo trace symptoms; ES log symptoms; consolidating the small isErrorLine/matchesWorkload helpers shared across docker-backed tools.

🤖 Generated with Claude Code

rlaope added 3 commits May 28, 2026 17:41
correlate.workload now folds symptom signals onto the change timeline and
aligns them with the change that most likely caused them:
- metric: a PromQL breach (metric_query + metric_threshold args) emits a
  metric_breach event at the first sample over threshold (Prometheus range).
- log: Loki error-line bursts emit a log_error event at the first error line.
- trace: Jaeger error/slow spans emit trace_error/trace_slow at span start
  (adds a read-only JaegerClient.SearchErrorSpans; Tempo deferred to v3).
- candidate-cause v2: the most recent change strictly before the earliest
  symptom, falling back to the newest change when there are no symptoms.

Symptoms reuse change.ChangeEvent (symptom Kinds + Source) so everything
merges on one newest-first timeline. Read-only, RiskLow.

Signed-off-by: rlaope <piyrw9754@gmail.com>
Seed peakValue from the first observed sample (havePeak flag) so a metric
whose values are all negative reports a real peak instead of a spurious 0,
and drop the stale comment describing a re-scan that never happened. Found in
architect review; Summary-string only, breach detection unchanged.

Signed-off-by: rlaope <piyrw9754@gmail.com>
Fix the log_source LogQL comment that promised an unimplemented line-filter
fallback, and clarify in the namespace arg description that the metric/log/
trace symptom sources are namespace-agnostic in v2. Comment/schema text only;
no behavior change. Found in code review.

Signed-off-by: rlaope <piyrw9754@gmail.com>
@rlaope rlaope merged commit cf570b8 into master May 28, 2026
2 checks passed
@rlaope rlaope deleted the feat/correlate-symptoms branch May 28, 2026 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant