Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions canary-checker/docs/concepts/distributed-canaries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
---
title: Distributed Canaries
sidebar_custom_props:
icon: network
sidebar_position: 6
---

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add that this will only work for mission control

Distributed canaries allow you to define a check once and have it automatically run on multiple agents. This is useful for monitoring services from different locations, clusters, or network segments.

## How It Works

When you specify an `agentSelector` on a canary:

1. The canary does **not** run locally on the server
2. A copy of the canary is created for each matched agent
3. Each agent runs the check independently and reports results back
4. The copies are kept in sync with the parent canary

A background job syncs agent selector canaries every 5 minutes. When agents are added or removed, the derived canaries are automatically created or cleaned up.
Comment on lines +12 to +19
Copy link

@coderabbitai coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rewrite passive voice in the “How It Works” steps.

Lines 14–17 use passive voice. Please switch to active voice to match the documentation style guide.

Suggested rewrite
-1. The canary does **not** run locally on the server
-2. A copy of the canary is created for each matched agent
-3. Each agent runs the check independently and reports results back
-4. The copies are kept in sync with the parent canary
+1. The server does **not** run the canary locally.
+2. The system creates a copy of the canary for each matched agent.
+3. Each agent runs the check independently and reports results back.
+4. The system keeps the copies in sync with the parent canary.

As per coding guidelines, use active voice; never use passive voice.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
When you specify an `agentSelector` on a canary:
1. The canary does **not** run locally on the server
2. A copy of the canary is created for each matched agent
3. Each agent runs the check independently and reports results back
4. The copies are kept in sync with the parent canary
A background job syncs agent selector canaries every 5 minutes. When agents are added or removed, the derived canaries are automatically created or cleaned up.
When you specify an `agentSelector` on a canary:
1. The server does **not** run the canary locally.
2. The system creates a copy of the canary for each matched agent.
3. Each agent runs the check independently and reports results back.
4. The system keeps the copies in sync with the parent canary.
A background job syncs agent selector canaries every 5 minutes. When agents are added or removed, the derived canaries are automatically created or cleaned up.
🤖 Prompt for AI Agents
In `@canary-checker/docs/concepts/distributed-canaries.md` around lines 12 - 19,
Rewrite the four numbered "How It Works" bullets to use active voice: change
"The canary does not run locally on the server" to "Specifying an agentSelector
prevents the canary from running locally on the server" (or similar active
phrasing), change "A copy of the canary is created for each matched agent" to
"The system creates a copy of the canary for each matched agent," change "Each
agent runs the check independently and reports results back" to "Each matched
agent runs the check independently and reports the results," and change "The
copies are kept in sync with the parent canary" to "The system keeps the derived
copies in sync with the parent canary"; also rewrite the next sentence about
syncing to active voice (e.g., "A background job syncs agent selector canaries
every 5 minutes" is already active—ensure consistency).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Mission Control, not server and system

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshloop, thank you for the clarification! I'll update the suggestion to use "Mission Control" consistently.

Revised suggestion:

When you specify an `agentSelector` on a canary:

1. Mission Control does **not** run the canary locally.
2. Mission Control creates a copy of the canary for each matched agent.
3. Each agent runs the check independently and reports results back.
4. Mission Control keeps the copies in sync with the parent canary.

A background job syncs agent selector canaries every 5 minutes. When agents are added or removed, the derived canaries are automatically created or cleaned up.

✏️ Learnings added
Learnt from: moshloop
Repo: flanksource/docs PR: 469
File: canary-checker/docs/concepts/distributed-canaries.md:12-19
Timestamp: 2026-01-29T13:01:15.936Z
Learning: In flanksource/docs documentation, use "Mission Control" to refer to the central server/system component, not generic terms like "server" or "system".

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: flanksource/docs PR: 0
File: .cursor/rules/global.mdc:0-0
Timestamp: 2025-12-05T13:02:08.910Z
Learning: Applies to **/*.{md,markdown,txt} : Use active voice; never use passive voice


## Agent Selector Patterns

The `agentSelector` field accepts a list of patterns to match agent names:

| Pattern | Description |
| ------------------- | ------------------------------------ |
| `agent-1` | Exact match |
| `eu-west-*` | Prefix match (glob) |
| `*-prod` | Suffix match (glob) |
| `!staging` | Exclude agents matching this pattern |
| `team-*`, `!team-b` | Match all `team-*` except `team-b` |

## Example: HTTP Check on All Agents

This example creates an HTTP check for a Kubernetes service that runs on every agent matching the pattern:

```yaml title="distributed-http-check.yaml"
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: api-health
namespace: monitoring
spec:
schedule: '@every 1m'
http:
- name: api-endpoint
url: http://api-service.default.svc.cluster.local:8080/health
responseCodes: [200]
test:
expr: json.status == 'healthy'
agentSelector:
- '*' # Run on all agents
```

When this canary is created:

1. The check is executed locally only when `local` agent is provided in selector
2. A derived canary is created for each registered agent
3. Each agent executes the HTTP check against `api-service.default.svc.cluster.local:8080/health` in its own cluster
4. Results from all agents are aggregated and visible in the UI
Comment on lines +55 to +60
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use active voice in the execution steps.

Line 57 uses passive voice. Please switch to active voice.

Suggested rewrite
-1. The check is executed locally only when `local` agent is provided in selector
+1. The system executes the check locally only when you include the `local` agent in the selector.

As per coding guidelines, use active voice; never use passive voice.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
When this canary is created:
1. The check is executed locally only when `local` agent is provided in selector
2. A derived canary is created for each registered agent
3. Each agent executes the HTTP check against `api-service.default.svc.cluster.local:8080/health` in its own cluster
4. Results from all agents are aggregated and visible in the UI
When this canary is created:
1. The system executes the check locally only when you include the `local` agent in the selector.
2. A derived canary is created for each registered agent
3. Each agent executes the HTTP check against `api-service.default.svc.cluster.local:8080/health` in its own cluster
4. Results from all agents are aggregated and visible in the UI
🤖 Prompt for AI Agents
In `@canary-checker/docs/concepts/distributed-canaries.md` around lines 55 - 60,
Change the passive sentence "The check is executed locally only when `local`
agent is provided in selector" to active voice; locate the step that begins with
that phrase and rewrite it to something like "Execute the check locally only
when the `local` agent is specified in the selector" (or "The agent executes the
check locally only when the `local` agent is specified in the selector") so the
sentence is active while preserving meaning and the `local`/selector tokens.


## Example: Regional Monitoring

Monitor an external API from specific regions:

```yaml title="regional-monitoring.yaml"
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: external-api-latency
spec:
schedule: '@every 5m'
http:
- name: payment-gateway
url: https://api.payment-provider.com/health
responseCodes: [200]
maxResponseTime: 500
agentSelector:
- 'eu-*' # All EU agents
- 'us-*' # All US agents
- '!us-test' # Exclude test agent
- 'local' # Run on local instance as well
```

## Example: Exclude Specific Agents

Run checks on all agents except those in a specific environment:

```yaml title="production-only.yaml"
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: production-checks
spec:
schedule: '@every 2m'
http:
- name: internal-service
url: http://internal.example.com/status
agentSelector:
- '!*-dev' # Exclude all dev agents
- '!*-staging' # Exclude all staging agents
```
293 changes: 293 additions & 0 deletions mission-control/blog/distributed-canaries/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
---
title: "Monitoring From Every Angle: A Guide to Distributed Canaries"
description: Learn how to run the same health check across multiple clusters and regions with a single canary definition
slug: distributed-canaries-tutorial
authors: [yash]
tags: [canary-checker, distributed, multi-cluster, agents]
hide_table_of_contents: false
---

# Monitoring From Every Angle: A Guide to Distributed Canaries

If you've ever managed services across multiple Kubernetes clusters, you know the pain. You write the same health check for cluster A, copy-paste it for cluster B, tweak it for cluster C, and before you know it, you're maintaining a dozen nearly-identical YAML files. When something changes, you're updating them all. It's tedious, error-prone, and frankly, a waste of time.

What if you could define a check once and have it automatically run everywhere you need it?

That's exactly what distributed canaries do.

<!--truncate -->

## The Problem With Multi-Cluster Monitoring

Let's say you're running an API service that's deployed across three clusters: one in `eu-west`, one in `us-east`, and one in `ap-south`. You want to monitor the `/health` endpoint from each cluster to ensure the service is responding correctly in all regions.

The naive approach looks something like this:

```yaml title="eu-west-cluster/api-health.yaml"
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: api-health
spec:
schedule: "@every 5m"
http:
- name: api-endpoint
url: http://api-service.default.svc:8080/health
responseCodes: [200]
```

Now multiply that by three clusters. And then by every service you want to monitor. You see where this is going.

## Enter Agent Selector

Canary Checker has a feature called `agentSelector` that solves this problem elegantly. Instead of deploying canaries to each cluster individually, you deploy agents to your clusters and define your canaries centrally with an `agentSelector` that specifies where they should run.

Here's the same check, but now it runs on all your agents:

```yaml title="api-health.yaml"
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: api-health
spec:
schedule: "@every 5m"
http:
- name: api-endpoint
url: http://api-service.default.svc:8080/health
responseCodes: [200]
agentSelector:
- "*" # Run on all agents
```

That's it. One file, all clusters.

## How It Actually Works

When you create a canary with an `agentSelector`, something interesting happens: the canary doesn't run on the central server at all. Instead, the system:

1. Looks at all registered agents
2. Matches agent names against your selector patterns
3. Creates a copy of the canary for each matched agent
4. Each agent runs the check independently and reports results back

The copies are kept in sync automatically. If you update the parent canary, all the derived canaries update too. If you add a new agent that matches the pattern, it gets the canary within a few minutes. If you remove an agent, its canary is cleaned up.

Comment on lines +64 to +74
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use active voice in the sync description.

Line 73 uses passive voice, which conflicts with the style guide. Please rewrite in active voice.

Suggested rewrite
- The copies are kept in sync automatically. If you update the parent canary, all the derived canaries update too. If you add a new agent that matches the pattern, it gets the canary within a few minutes. If you remove an agent, its canary is cleaned up.
+ The system keeps the copies in sync automatically. If you update the parent canary, all the derived canaries update too. If you add a new agent that matches the pattern, the system assigns the canary within a few minutes. If you remove an agent, the system cleans up its canary.

As per coding guidelines, use active voice; never use passive voice.

🧰 Tools
🪛 LanguageTool

[style] ~73-~73: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...t gets the canary within a few minutes. If you remove an agent, its canary is clea...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🤖 Prompt for AI Agents
In `@mission-control/blog/distributed-canaries/index.mdx` around lines 64 - 74,
The sentence "The copies are kept in sync automatically." in the document's "How
It Actually Works" section uses passive voice; change it to active voice by
rephrasing to something like "The system keeps the copies in sync
automatically," or "We keep the copies in sync automatically," and apply the
edit where that exact sentence appears to comply with the style guide.

## Tutorial: Setting Up Distributed Monitoring

Let's walk through a practical example. We'll set up monitoring for an internal service that needs to be checked from multiple clusters.

### Prerequisites

You'll need:
- A central Mission Control instance
- At least two Kubernetes clusters with agents installed

### Step 1: Register Your Agents

First, make sure your agents are registered with meaningful names. When you [install the agent helm chart](/docs/installation/saas/agent), you specify the agent name:

```bash
helm install mission-control-agent flanksource/mission-control-agent \
--set clusterName=<Unique name for this agent> \
--set upstream.agent=YOUR_LOCAL_NAME \
--set upstream.username=token \
--set upstream.password= \
--set upstream.host= \
-n mission-control --create-namespace \
--wait
```

Do this for each cluster with descriptive names like `eu-west-prod`, `us-east-prod`, `ap-south-prod`.

### Step 2: Create Your Distributed Canary

Now create a canary that targets all production agents:

```yaml title="distributed-service-check.yaml"
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: payment-service-health
namespace: monitoring
spec:
schedule: "@every 30s"
http:
- name: payment-api
url: http://payment-service.payments.svc.cluster.local:8080/health
responseCodes: [200]
maxResponseTime: 500
test:
expr: json.status == 'healthy' && json.database == 'connected'
agentSelector:
- "*-prod" # All agents ending with -prod
```

Apply this to your central Mission Control instance:

```bash
kubectl apply -f distributed-service-check.yaml
```

### Step 3: Verify It's Working

Within a few minutes, you should see derived canaries created for each agent. You can verify this in the Mission Control UI, or by checking the canaries list:

```bash
kubectl get canaries -A
```

You'll see the original canary plus one derived canary per matched agent.

## Pattern Matching Deep Dive

The `agentSelector` field is quite flexible. Here are some patterns you'll find useful:

### Select All Agents

```yaml
agentSelector:
- "*"
```

### Select by Prefix (Regional)

```yaml
agentSelector:
- "eu-*" # All European agents
- "us-*" # All US agents
```

### Select by Suffix (Environment)

```yaml
agentSelector:
- "*-prod" # All production agents
- "*-staging" # All staging agents
```

### Exclude Specific Agents

```yaml
agentSelector:
- "*-prod" # All production agents
- "!us-east-prod" # Except US East (maybe it's being decommissioned)
```

### Exclusion-Only Patterns

You can also just exclude, which means "all agents except these":

```yaml
agentSelector:
- "!*-dev" # All agents except dev
- "!*-test" # And except test
```

## Real-World Use Cases

### Geographic Latency Monitoring

Monitor an external API from all your regions to compare latency:

```yaml
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: stripe-api-latency
spec:
schedule: "@every 5m"
http:
- name: stripe-health
url: https://api.stripe.com/v1/health
responseCodes: [200]
maxResponseTime: 1000
agentSelector:
- "*"
```

Now you can see if Stripe is slower from one region than another.

### Internal Service Mesh Validation

Verify that internal services are reachable from all clusters:

```yaml
apiVersion: canaries.flanksource.com/v1
kind: Canary
metadata:
name: mesh-connectivity
spec:
schedule: "@every 1m"
http:
- name: auth-service
url: http://auth.internal.example.com/health
- name: user-service
url: http://users.internal.example.com/health
- name: orders-service
url: http://orders.internal.example.com/health
agentSelector:
- "*-prod"
```

### Gradual Rollout Monitoring

When rolling out a new service version, monitor it from a subset of clusters first:

```yaml
agentSelector:
- "us-east-prod" # Canary region first
```

Then expand:

```yaml
agentSelector:
- "us-*-prod" # All US production
```

And finally:

```yaml
agentSelector:
- "*-prod" # All production
```

## What Happens Under the Hood

The system runs a background sync job every 5 minutes that:

1. Finds all canaries with `agentSelector` set
2. For each canary, matches agent names against the patterns
3. Creates or updates derived canaries for matched agents
4. Deletes derived canaries for agents that no longer match

There's also an hourly cleanup job that removes orphaned derived canaries (when the parent canary is deleted).

This means:
- Changes propagate within 5 minutes
- You don't need to restart anything when adding agents
- The system is self-healing

## Tips and Gotchas

**Agent names matter.** Pick a naming convention early and stick to it. Something like `{region}-{environment}` works well.

**The parent canary doesn't run locally.** If you have an `agentSelector`, the canary only runs on the matched agents, not on the server where you applied it unless `local` is specified.

**Results are aggregated.** In the UI, you'll see results from all agents. This gives you a single view of service health across all locations.

**Start specific, then broaden.** When testing a new canary, start with a specific agent name, verify it works, then expand to patterns.

## Conclusion

Distributed canaries turn a maintenance headache into a one-liner. Instead of managing N copies of the same check across N clusters, you define it once and let the system handle the distribution.

The pattern matching is powerful enough to handle complex scenarios (regional rollouts, environment separation, gradual expansion) while staying simple for common cases.

If you're running services across multiple clusters and haven't tried this yet, give it a shot. Your future self will thank you.

## References

- [Distributed Canaries Concept](/docs/guide/canary-checker/concepts/distributed-canaries)
- [Canary Spec Reference](/docs/guide/canary-checker/reference/canary-spec)
- [Agent Installation Guide](/docs/docs/installation/saas/agent)
Comment on lines +289 to +293
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's locate the file and verify its content
find . -name "index.mdx" -path "*/distributed-canaries/*" | head -20

Repository: flanksource/docs

Length of output: 113


🏁 Script executed:

# If the above doesn't work, let's search more broadly
fd "index.mdx" | grep -i canary

Repository: flanksource/docs

Length of output: 224


🏁 Script executed:

# Also search for agent installation documentation to understand the correct path structure
fd -i "agent" -type f | grep -i install | head -30

Repository: flanksource/docs

Length of output: 230


🏁 Script executed:

# Let's check the overall docs structure
ls -la docs/ 2>/dev/null | head -20

Repository: flanksource/docs

Length of output: 745


🏁 Script executed:

# Search for any paths containing "installation" and "saas"
rg "installation/saas" --type md --type mdx -l

Repository: flanksource/docs

Length of output: 87


🏁 Script executed:

#!/bin/bash
# Search for agent-related docs
find . -type f \( -name "*.md" -o -name "*.mdx" \) | xargs grep -l "agent.*installation\|installation.*agent" | head -20

Repository: flanksource/docs

Length of output: 232


🏁 Script executed:

# Look for installation docs structure
find . -path "*installation*" -type d

Repository: flanksource/docs

Length of output: 265


🏁 Script executed:

# Check if there's a docs folder at the root level and its structure
find . -maxdepth 3 -name "*installation*" -o -name "*agent*" | grep -E "\.(md|mdx)$" | head -30

Repository: flanksource/docs

Length of output: 42


🏁 Script executed:

# Look at the actual file to see the links and understand the pattern
cat -n ./mission-control/blog/distributed-canaries/index.mdx | tail -20

Repository: flanksource/docs

Length of output: 1431


🏁 Script executed:

#!/bin/bash
# List all files in the saas installation directory
ls -la mission-control/docs/installation/saas/

Repository: flanksource/docs

Length of output: 474


🏁 Script executed:

# Also check the installation directory structure more broadly
find ./mission-control/docs/installation -type f | sort

Repository: flanksource/docs

Length of output: 1873


🏁 Script executed:

# Check the getting-started file to understand the structure
head -50 ./mission-control/docs/installation/saas/getting-started.mdx

Repository: flanksource/docs

Length of output: 1843


🏁 Script executed:

# Search for references to agent installation links in other docs
rg "installation.*agent|agent.*installation" --type-add 'mdx:*.mdx' -t mdx | head -20

Repository: flanksource/docs

Length of output: 893


Remove the duplicate "docs" segment in the agent installation link.

Line 293 uses /docs/docs/installation/saas/agent, which contains a duplicated "docs" segment. The correct path is /docs/installation/saas/agent, which matches the format used elsewhere in this file (line 199) and in other documentation files.

🤖 Prompt for AI Agents
In `@mission-control/blog/distributed-canaries/index.mdx` around lines 289 - 293,
Update the "Agent Installation Guide" link so it removes the duplicated "docs"
segment in its URL; locate the markdown link text "Agent Installation Guide" in
the References section and change the href from
/docs/docs/installation/saas/agent to /docs/installation/saas/agent so it
matches the other links (e.g., the format used at line with the "Canary Spec
Reference").