From 447024ba1babe52953cd982eb050349b6e9e1167 Mon Sep 17 00:00:00 2001 From: Yash Mehrotra Date: Thu, 29 Jan 2026 17:05:16 +0530 Subject: [PATCH 1/2] chore: add distributed canaries docs + blog --- .../docs/concepts/distributed-canaries.md | 102 ++++++ .../blog/distributed-canaries/index.mdx | 293 ++++++++++++++++++ 2 files changed, 395 insertions(+) create mode 100644 canary-checker/docs/concepts/distributed-canaries.md create mode 100644 mission-control/blog/distributed-canaries/index.mdx diff --git a/canary-checker/docs/concepts/distributed-canaries.md b/canary-checker/docs/concepts/distributed-canaries.md new file mode 100644 index 00000000..8ba27a8a --- /dev/null +++ b/canary-checker/docs/concepts/distributed-canaries.md @@ -0,0 +1,102 @@ +--- +title: Distributed Canaries +sidebar_custom_props: + icon: network +sidebar_position: 6 +--- + +Distributed canaries allow you to define a check once and have it automatically run on multiple agents. This is useful for monitoring services from different locations, clusters, or network segments. + +## How It Works + +When you specify an `agentSelector` on a canary: + +1. The canary does **not** run locally on the server +2. A copy of the canary is created for each matched agent +3. Each agent runs the check independently and reports results back +4. The copies are kept in sync with the parent canary + +A background job syncs agent selector canaries every 5 minutes. When agents are added or removed, the derived canaries are automatically created or cleaned up. + +## Agent Selector Patterns + +The `agentSelector` field accepts a list of patterns to match agent names: + +| Pattern | Description | +|---------|-------------| +| `agent-1` | Exact match | +| `eu-west-*` | Prefix match (glob) | +| `*-prod` | Suffix match (glob) | +| `!staging` | Exclude agents matching this pattern | +| `team-*`, `!team-b` | Match all `team-*` except `team-b` | + +## Example: HTTP Check on All Agents + +This example creates an HTTP check for a Kubernetes service that runs on every agent matching the pattern: + +```yaml title="distributed-http-check.yaml" +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: api-health + namespace: monitoring +spec: + schedule: "@every 1m" + http: + - name: api-endpoint + url: http://api-service.default.svc.cluster.local:8080/health + responseCodes: [200] + test: + expr: json.status == 'healthy' + agentSelector: + - "*" # Run on all agents +``` + +When this canary is created: + +1. The check is executed locally only when `local` agent is provided in selector +2. A derived canary is created for each registered agent +3. Each agent executes the HTTP check against `api-service.default.svc.cluster.local:8080/health` in its own cluster +4. Results from all agents are aggregated and visible in the UI + +## Example: Regional Monitoring + +Monitor an external API from specific regions: + +```yaml title="regional-monitoring.yaml" +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: external-api-latency +spec: + schedule: "@every 5m" + http: + - name: payment-gateway + url: https://api.payment-provider.com/health + responseCodes: [200] + maxResponseTime: 500 + agentSelector: + - "eu-*" # All EU agents + - "us-*" # All US agents + - "!us-test" # Exclude test agent + - "local" # Run on local instance as well +``` + +## Example: Exclude Specific Agents + +Run checks on all agents except those in a specific environment: + +```yaml title="production-only.yaml" +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: production-checks +spec: + schedule: "@every 2m" + http: + - name: internal-service + url: http://internal.example.com/status + agentSelector: + - "!*-dev" # Exclude all dev agents + - "!*-staging" # Exclude all staging agents +``` diff --git a/mission-control/blog/distributed-canaries/index.mdx b/mission-control/blog/distributed-canaries/index.mdx new file mode 100644 index 00000000..ad90b1dd --- /dev/null +++ b/mission-control/blog/distributed-canaries/index.mdx @@ -0,0 +1,293 @@ +--- +title: "Monitoring From Every Angle: A Guide to Distributed Canaries" +description: Learn how to run the same health check across multiple clusters and regions with a single canary definition +slug: distributed-canaries-tutorial +authors: [yash] +tags: [canary-checker, distributed, multi-cluster, agents] +hide_table_of_contents: false +--- + +# Monitoring From Every Angle: A Guide to Distributed Canaries + +If you've ever managed services across multiple Kubernetes clusters, you know the pain. You write the same health check for cluster A, copy-paste it for cluster B, tweak it for cluster C, and before you know it, you're maintaining a dozen nearly-identical YAML files. When something changes, you're updating them all. It's tedious, error-prone, and frankly, a waste of time. + +What if you could define a check once and have it automatically run everywhere you need it? + +That's exactly what distributed canaries do. + + + +## The Problem With Multi-Cluster Monitoring + +Let's say you're running an API service that's deployed across three clusters: one in `eu-west`, one in `us-east`, and one in `ap-south`. You want to monitor the `/health` endpoint from each cluster to ensure the service is responding correctly in all regions. + +The naive approach looks something like this: + +```yaml title="eu-west-cluster/api-health.yaml" +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: api-health +spec: + schedule: "@every 5m" + http: + - name: api-endpoint + url: http://api-service.default.svc:8080/health + responseCodes: [200] +``` + +Now multiply that by three clusters. And then by every service you want to monitor. You see where this is going. + +## Enter Agent Selector + +Canary Checker has a feature called `agentSelector` that solves this problem elegantly. Instead of deploying canaries to each cluster individually, you deploy agents to your clusters and define your canaries centrally with an `agentSelector` that specifies where they should run. + +Here's the same check, but now it runs on all your agents: + +```yaml title="api-health.yaml" +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: api-health +spec: + schedule: "@every 5m" + http: + - name: api-endpoint + url: http://api-service.default.svc:8080/health + responseCodes: [200] + agentSelector: + - "*" # Run on all agents +``` + +That's it. One file, all clusters. + +## How It Actually Works + +When you create a canary with an `agentSelector`, something interesting happens: the canary doesn't run on the central server at all. Instead, the system: + +1. Looks at all registered agents +2. Matches agent names against your selector patterns +3. Creates a copy of the canary for each matched agent +4. Each agent runs the check independently and reports results back + +The copies are kept in sync automatically. If you update the parent canary, all the derived canaries update too. If you add a new agent that matches the pattern, it gets the canary within a few minutes. If you remove an agent, its canary is cleaned up. + +## Tutorial: Setting Up Distributed Monitoring + +Let's walk through a practical example. We'll set up monitoring for an internal service that needs to be checked from multiple clusters. + +### Prerequisites + +You'll need: +- A central Mission Control instance +- At least two Kubernetes clusters with agents installed + +### Step 1: Register Your Agents + +First, make sure your agents are registered with meaningful names. When you [install the agent helm chart](/docs/installation/saas/agent), you specify the agent name: + +```bash +helm install mission-control-agent flanksource/mission-control-agent \ + --set clusterName= \ + --set upstream.agent=YOUR_LOCAL_NAME \ + --set upstream.username=token \ + --set upstream.password= \ + --set upstream.host= \ + -n mission-control --create-namespace \ + --wait +``` + +Do this for each cluster with descriptive names like `eu-west-prod`, `us-east-prod`, `ap-south-prod`. + +### Step 2: Create Your Distributed Canary + +Now create a canary that targets all production agents: + +```yaml title="distributed-service-check.yaml" +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: payment-service-health + namespace: monitoring +spec: + schedule: "@every 30s" + http: + - name: payment-api + url: http://payment-service.payments.svc.cluster.local:8080/health + responseCodes: [200] + maxResponseTime: 500 + test: + expr: json.status == 'healthy' && json.database == 'connected' + agentSelector: + - "*-prod" # All agents ending with -prod +``` + +Apply this to your central Mission Control instance: + +```bash +kubectl apply -f distributed-service-check.yaml +``` + +### Step 3: Verify It's Working + +Within a few minutes, you should see derived canaries created for each agent. You can verify this in the Mission Control UI, or by checking the canaries list: + +```bash +kubectl get canaries -A +``` + +You'll see the original canary plus one derived canary per matched agent. + +## Pattern Matching Deep Dive + +The `agentSelector` field is quite flexible. Here are some patterns you'll find useful: + +### Select All Agents + +```yaml +agentSelector: + - "*" +``` + +### Select by Prefix (Regional) + +```yaml +agentSelector: + - "eu-*" # All European agents + - "us-*" # All US agents +``` + +### Select by Suffix (Environment) + +```yaml +agentSelector: + - "*-prod" # All production agents + - "*-staging" # All staging agents +``` + +### Exclude Specific Agents + +```yaml +agentSelector: + - "*-prod" # All production agents + - "!us-east-prod" # Except US East (maybe it's being decommissioned) +``` + +### Exclusion-Only Patterns + +You can also just exclude, which means "all agents except these": + +```yaml +agentSelector: + - "!*-dev" # All agents except dev + - "!*-test" # And except test +``` + +## Real-World Use Cases + +### Geographic Latency Monitoring + +Monitor an external API from all your regions to compare latency: + +```yaml +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: stripe-api-latency +spec: + schedule: "@every 5m" + http: + - name: stripe-health + url: https://api.stripe.com/v1/health + responseCodes: [200] + maxResponseTime: 1000 + agentSelector: + - "*" +``` + +Now you can see if Stripe is slower from one region than another. + +### Internal Service Mesh Validation + +Verify that internal services are reachable from all clusters: + +```yaml +apiVersion: canaries.flanksource.com/v1 +kind: Canary +metadata: + name: mesh-connectivity +spec: + schedule: "@every 1m" + http: + - name: auth-service + url: http://auth.internal.example.com/health + - name: user-service + url: http://users.internal.example.com/health + - name: orders-service + url: http://orders.internal.example.com/health + agentSelector: + - "*-prod" +``` + +### Gradual Rollout Monitoring + +When rolling out a new service version, monitor it from a subset of clusters first: + +```yaml +agentSelector: + - "us-east-prod" # Canary region first +``` + +Then expand: + +```yaml +agentSelector: + - "us-*-prod" # All US production +``` + +And finally: + +```yaml +agentSelector: + - "*-prod" # All production +``` + +## What Happens Under the Hood + +The system runs a background sync job every 5 minutes that: + +1. Finds all canaries with `agentSelector` set +2. For each canary, matches agent names against the patterns +3. Creates or updates derived canaries for matched agents +4. Deletes derived canaries for agents that no longer match + +There's also an hourly cleanup job that removes orphaned derived canaries (when the parent canary is deleted). + +This means: +- Changes propagate within 5 minutes +- You don't need to restart anything when adding agents +- The system is self-healing + +## Tips and Gotchas + +**Agent names matter.** Pick a naming convention early and stick to it. Something like `{region}-{environment}` works well. + +**The parent canary doesn't run locally.** If you have an `agentSelector`, the canary only runs on the matched agents, not on the server where you applied it unless `local` is specified. + +**Results are aggregated.** In the UI, you'll see results from all agents. This gives you a single view of service health across all locations. + +**Start specific, then broaden.** When testing a new canary, start with a specific agent name, verify it works, then expand to patterns. + +## Conclusion + +Distributed canaries turn a maintenance headache into a one-liner. Instead of managing N copies of the same check across N clusters, you define it once and let the system handle the distribution. + +The pattern matching is powerful enough to handle complex scenarios (regional rollouts, environment separation, gradual expansion) while staying simple for common cases. + +If you're running services across multiple clusters and haven't tried this yet, give it a shot. Your future self will thank you. + +## References + +- [Distributed Canaries Concept](/docs/guide/canary-checker/concepts/distributed-canaries) +- [Canary Spec Reference](/docs/guide/canary-checker/reference/canary-spec) +- [Agent Installation Guide](/docs/docs/installation/saas/agent) From 165ca77cddba9211abc4629e039b6a0edc6ec244 Mon Sep 17 00:00:00 2001 From: Yash Mehrotra Date: Thu, 29 Jan 2026 18:29:36 +0530 Subject: [PATCH 2/2] chore: run task fmt --- .../docs/concepts/distributed-canaries.md | 34 +++++++++---------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/canary-checker/docs/concepts/distributed-canaries.md b/canary-checker/docs/concepts/distributed-canaries.md index 8ba27a8a..9ad7ced4 100644 --- a/canary-checker/docs/concepts/distributed-canaries.md +++ b/canary-checker/docs/concepts/distributed-canaries.md @@ -22,13 +22,13 @@ A background job syncs agent selector canaries every 5 minutes. When agents are The `agentSelector` field accepts a list of patterns to match agent names: -| Pattern | Description | -|---------|-------------| -| `agent-1` | Exact match | -| `eu-west-*` | Prefix match (glob) | -| `*-prod` | Suffix match (glob) | -| `!staging` | Exclude agents matching this pattern | -| `team-*`, `!team-b` | Match all `team-*` except `team-b` | +| Pattern | Description | +| ------------------- | ------------------------------------ | +| `agent-1` | Exact match | +| `eu-west-*` | Prefix match (glob) | +| `*-prod` | Suffix match (glob) | +| `!staging` | Exclude agents matching this pattern | +| `team-*`, `!team-b` | Match all `team-*` except `team-b` | ## Example: HTTP Check on All Agents @@ -41,7 +41,7 @@ metadata: name: api-health namespace: monitoring spec: - schedule: "@every 1m" + schedule: '@every 1m' http: - name: api-endpoint url: http://api-service.default.svc.cluster.local:8080/health @@ -49,7 +49,7 @@ spec: test: expr: json.status == 'healthy' agentSelector: - - "*" # Run on all agents + - '*' # Run on all agents ``` When this canary is created: @@ -69,17 +69,17 @@ kind: Canary metadata: name: external-api-latency spec: - schedule: "@every 5m" + schedule: '@every 5m' http: - name: payment-gateway url: https://api.payment-provider.com/health responseCodes: [200] maxResponseTime: 500 agentSelector: - - "eu-*" # All EU agents - - "us-*" # All US agents - - "!us-test" # Exclude test agent - - "local" # Run on local instance as well + - 'eu-*' # All EU agents + - 'us-*' # All US agents + - '!us-test' # Exclude test agent + - 'local' # Run on local instance as well ``` ## Example: Exclude Specific Agents @@ -92,11 +92,11 @@ kind: Canary metadata: name: production-checks spec: - schedule: "@every 2m" + schedule: '@every 2m' http: - name: internal-service url: http://internal.example.com/status agentSelector: - - "!*-dev" # Exclude all dev agents - - "!*-staging" # Exclude all staging agents + - '!*-dev' # Exclude all dev agents + - '!*-staging' # Exclude all staging agents ```