Skip to content

Add health monitoring for MoQ CDN relay nodes#967

Merged
kixelated merged 4 commits intomainfrom
claude/cdn-node-health-checks-abLA2
Feb 17, 2026
Merged

Add health monitoring for MoQ CDN relay nodes#967
kixelated merged 4 commits intomainfrom
claude/cdn-node-health-checks-abLA2

Conversation

@kixelated
Copy link
Collaborator

Summary

This PR adds comprehensive health monitoring for the MoQ CDN relay nodes using Google Cloud Platform's monitoring and alerting services, along with a local health check script.

Key Changes

  • New Terraform configuration (cdn/health.tf): Sets up GCP uptime checks for each relay node (usc, euc, sea) that verify the /fetch/demo/bbb/catalog.json endpoint every 5 minutes. Includes an optional email notification channel and alert policy that triggers when any node fails its health check for 5+ minutes.
  • New health check script (cdn/health.sh): Bash script that manually tests all relay nodes and reports their status. Supports optional webhook notifications (Slack, Discord, etc.) on failure.
  • Updated justfile: Added health and health-notify recipes to easily run health checks locally.
  • Updated README: Documents how to generate the demo-sub.jwt token needed for health checks and subscribing to demo content.
  • Updated main.tf: Enabled the monitoring.googleapis.com GCP service required for uptime checks and alerting.

Implementation Details

  • Health checks use the same JWT token (demo-sub.jwt) for authentication as the website and other subscribers
  • GCP uptime checks run every 5 minutes with a 10-second timeout
  • Alerts only fire after 5 minutes of continuous failure to avoid false positives
  • Email notifications are optional (only created if health_email variable is provided)
  • The local health check script can be run manually or integrated with external monitoring systems via webhook

https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV

- Add per-node GCP uptime checks that probe each relay (usc, euc, sea)
  via the /fetch/demo/bbb/catalog.json endpoint every 5 minutes
- Add alert policy with email notifications on sustained failures
- Add standalone health.sh script for manual/cron checks with optional
  webhook (Slack/Discord) notification on failure
- Add `health` and `health-notify` just commands
- Enable monitoring.googleapis.com API

https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV
The demo GET token is generated alongside the other secrets (demo-pub.jwt,
cluster.jwt) using moq-token. Both the Terraform uptime checks and the
health.sh script now read from secrets/demo-get.jwt automatically.

https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

No actionable comments were generated in the recent review. 🎉


Walkthrough

Adds CDN health monitoring: README expanded with a demo subscription creation step; a Bash health-check that reads a demo JWT, probes multiple relay nodes for a demo catalog endpoint, reports per-node status, and can post failure summaries to an optional webhook; Terraform additions that introduce a health_email input, create uptime checks per relay, and conditionally provision an email notification channel and alert policy; new justfile targets to run health checks; and includes monitoring.googleapis.com in the default GCP service list.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main purpose of the PR: adding health monitoring for MoQ CDN relay nodes, which is clearly the primary objective reflected throughout all changeset files.
Description check ✅ Passed The description is comprehensive and well-related to the changeset, detailing all major changes including Terraform health configuration, health check script, justfile updates, README documentation, and GCP service enablement.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude/cdn-node-health-checks-abLA2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
cdn/health.tf (1)

21-22: JWT token exposed in URL query parameter.

The JWT is embedded in the uptime check URL, which means it will be visible in GCP Monitoring console, logs, and API responses to anyone with project access. Since this is a read-only token for demo content, the risk is limited, but worth being aware of — especially if the token's scope ever broadens.

GCP uptime checks support custom headers via http_check.headers, which would keep the token out of the URL. Worth considering if the JWT grants more than demo read access.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.tf` around lines 21 - 22, The uptime check embeds local.health_jwt
in the URL (http_check.path), exposing the JWT in monitoring/logs; change the
check to remove the JWT from the path and instead set it in http_check.headers
(e.g., add an Authorization or X-Health-JWT header) so the token is sent as a
header; update the http_check block to reference local.health_jwt in
http_check.headers rather than in the path.
cdn/health.sh (1)

56-60: Webhook JSON payload is not escaped.

The $msg variable is interpolated directly into the JSON string. Currently safe since node names are simple alphanumeric strings, but if any node name or domain ever contained quotes or backslashes, the JSON would be malformed. A minor hardening would be to use jq for JSON construction, if available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.sh` around lines 56 - 60, The webhook JSON currently interpolates
$msg directly into the curl -d string, which can break JSON if $msg contains
quotes or backslashes; change the block that checks WEBHOOK_URL to build a safe
JSON payload instead of inline interpolation: detect or require jq and generate
the payload with jq -n --arg text "$msg" '{text:$text}' (send it to curl via
--data `@-`), or if jq is not available use a robust escaping fallback (e.g.,
escape backslashes and quotes in $msg before embedding) so the curl invocation
always receives valid JSON; update references to WEBHOOK_URL, msg and the curl
POST logic accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cdn/health.sh`:
- Line 8: Update the stale comment in cdn/health.sh that currently reads "Reads
the JWT from secrets/demo-get.jwt" to reference the renamed file
"secrets/demo-sub.jwt" (i.e., change the comment string to mention demo-sub.jwt)
so the comment matches the actual token filename used by the script.

In `@cdn/health.tf`:
- Around line 9-11: The locals block currently uses
file("${path.module}/secrets/demo-sub.jwt") to populate health_jwt which causes
Terraform to fail during plan; change it to conditionally read the file only if
it exists by using fileexists("${path.module}/secrets/demo-sub.jwt") as a guard
and otherwise defaulting to an empty string (keep the symbol health_jwt and the
locals block unchanged), so replace the direct file(...) call with a conditional
expression using fileexists(...) ? file(...) : "" to allow plan/apply to succeed
before the secrets directory is created.
- Around line 57-76: The alert condition in the condition_threshold block
currently uses COMPARISON_GT with threshold_value = 1 and reducer
REDUCE_COUNT_FALSE, which requires >1 failures to trigger; update
threshold_value to 0 so that any reduced count of 1 (a single uptime check
failure per alignment_period) will satisfy COMPARISON_GT and fire the
alert—modify the threshold_value in the condition_threshold block (the same
block that defines aggregations with alignment_period, per_series_aligner,
cross_series_reducer = "REDUCE_COUNT_FALSE", and trigger) from 1 to 0.

---

Nitpick comments:
In `@cdn/health.sh`:
- Around line 56-60: The webhook JSON currently interpolates $msg directly into
the curl -d string, which can break JSON if $msg contains quotes or backslashes;
change the block that checks WEBHOOK_URL to build a safe JSON payload instead of
inline interpolation: detect or require jq and generate the payload with jq -n
--arg text "$msg" '{text:$text}' (send it to curl via --data `@-`), or if jq is
not available use a robust escaping fallback (e.g., escape backslashes and
quotes in $msg before embedding) so the curl invocation always receives valid
JSON; update references to WEBHOOK_URL, msg and the curl POST logic accordingly.

In `@cdn/health.tf`:
- Around line 21-22: The uptime check embeds local.health_jwt in the URL
(http_check.path), exposing the JWT in monitoring/logs; change the check to
remove the JWT from the path and instead set it in http_check.headers (e.g., add
an Authorization or X-Health-JWT header) so the token is sent as a header;
update the http_check block to reference local.health_jwt in http_check.headers
rather than in the path.

Comment on lines +9 to +11
locals {
health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt"))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if cdn/health.tf exists and examine the context around line 10
head -20 cdn/health.tf

Repository: moq-dev/moq

Length of output: 601


🏁 Script executed:

# Check if the secrets file exists
ls -la cdn/secrets/ 2>/dev/null || echo "secrets directory does not exist"

Repository: moq-dev/moq

Length of output: 87


🏁 Script executed:

# Search for other file() calls in terraform files to understand the pattern
rg -n 'file\s*\(' --type=tf -B2 -A2

Repository: moq-dev/moq

Length of output: 1072


🏁 Script executed:

# Look for any bootstrap or setup documentation
find . -type f -name "*.md" -o -name "*.txt" -o -name "setup*" -o -name "bootstrap*" | head -20

Repository: moq-dev/moq

Length of output: 487


🏁 Script executed:

# Check cdn/README.md for deployment instructions
cat cdn/README.md

Repository: moq-dev/moq

Length of output: 2290


🏁 Script executed:

# Check if bootstrap.sh mentions generating the JWT
head -50 cdn/bootstrap.sh

Repository: moq-dev/moq

Length of output: 1135


🏁 Script executed:

# Check cdn/bootstrap.tf to see the full bootstrap context
cat cdn/bootstrap.tf

Repository: moq-dev/moq

Length of output: 653


The file() call on line 10 will block the documented bootstrap workflow.

The README's Setup section explicitly requires running tofu apply (Step 3) before creating the secrets/ directory with the JWT (Step 4). However, file() is evaluated during tofu plan (part of tofu apply), so the plan will fail before the user can generate the JWT, blocking the initial deployment.

The suggested fix using fileexists() guard is appropriate:

Option: guard with fileexists
 locals {
-  health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt"))
+  health_jwt = fileexists("${path.module}/secrets/demo-sub.jwt") ? trimspace(file("${path.module}/secrets/demo-sub.jwt")) : ""
 }

This allows the initial tofu plan/apply to succeed (with an empty JWT) before secrets are generated, unblocking the bootstrap sequence.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
locals {
health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt"))
}
locals {
health_jwt = fileexists("${path.module}/secrets/demo-sub.jwt") ? trimspace(file("${path.module}/secrets/demo-sub.jwt")) : ""
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.tf` around lines 9 - 11, The locals block currently uses
file("${path.module}/secrets/demo-sub.jwt") to populate health_jwt which causes
Terraform to fail during plan; change it to conditionally read the file only if
it exists by using fileexists("${path.module}/secrets/demo-sub.jwt") as a guard
and otherwise defaulting to an empty string (keep the symbol health_jwt and the
locals block unchanged), so replace the direct file(...) call with a conditional
expression using fileexists(...) ? file(...) : "" to allow plan/apply to succeed
before the secrets directory is created.

Comment on lines +57 to +76
conditions {
display_name = "Uptime check failing"

condition_threshold {
filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\""
duration = "300s" # must fail for 5 minutes before alerting
comparison = "COMPARISON_GT"
threshold_value = 1

aggregations {
alignment_period = "300s"
per_series_aligner = "ALIGN_NEXT_OLDER"
cross_series_reducer = "REDUCE_COUNT_FALSE"
group_by_fields = ["resource.label.host"]
}

trigger {
count = 1
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

GCP monitoring REDUCE_COUNT_FALSE COMPARISON_GT threshold condition

💡 Result:

In Cloud Monitoring metric-threshold conditions, REDUCE_COUNT_FALSE is a cross-series reducer that converts Boolean time series into an INT64 count of “false” points per alignment period (across the grouped time series you’re reducing). (docs.cloud.google.com)

How REDUCE_COUNT_FALSE + COMPARISON_GT + threshold evaluates

  1. Align first (required): If you set a crossSeriesReducer (like REDUCE_COUNT_FALSE), you must also set:
  2. Reduce across series: For each alignment window, Monitoring computes how many “false” data points occurred across the time series in each group (per groupByFields). Output is INT64. (docs.cloud.google.com)
  3. Compare to threshold: COMPARISON_GT triggers when the reduced value is greater than thresholdValue (left side is the time series value, right side is the threshold). Cloud Monitoring metric-threshold conditions support only COMPARISON_GT and COMPARISON_LT. (cloud.google.com)
  4. Hold for duration: The comparison must remain true for duration to be considered failing. (cloud.google.com)

Example (API-style / Terraform-style shape)

This triggers if the count of false points in each 60s window is > 0 (i.e., at least one “false”) for 5 minutes:

{
  "conditionThreshold": {
    "filter": "metric.type=\"custom.googleapis.com/my/boolean_metric\" AND resource.type=\"gce_instance\"",
    "aggregations": [{
      "alignmentPeriod": "60s",
      "perSeriesAligner": "ALIGN_NEXT_OLDER",
      "crossSeriesReducer": "REDUCE_COUNT_FALSE",
      "groupByFields": ["resource.label.instance_id"]
    }],
    "comparison": "COMPARISON_GT",
    "thresholdValue": 0,
    "duration": "300s"
  }
}

If you share your metric type (and whether you want to count false across instances or per instance), I can suggest the right groupByFields and aligner for your exact intent.

Citations:


Fix alert threshold to trigger on any uptime check failure.

The condition uses REDUCE_COUNT_FALSE with COMPARISON_GT and threshold_value = 1. According to GCP's metric-threshold evaluation, COMPARISON_GT triggers when the reduced count is greater than the threshold value. With threshold_value = 1, the alert fires only when count > 1 (requiring 2+ failures per alignment period).

Given that each host receives one uptime check per 300s and the alignment period is also 300s, you'll have at most one data point per host per window. A single failure will produce a count of 1, which is not > 1, so the alert will never trigger. Change the threshold to 0 to fire when any check fails:

diff
-      threshold_value = 1
+      threshold_value = 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
conditions {
display_name = "Uptime check failing"
condition_threshold {
filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\""
duration = "300s" # must fail for 5 minutes before alerting
comparison = "COMPARISON_GT"
threshold_value = 1
aggregations {
alignment_period = "300s"
per_series_aligner = "ALIGN_NEXT_OLDER"
cross_series_reducer = "REDUCE_COUNT_FALSE"
group_by_fields = ["resource.label.host"]
}
trigger {
count = 1
}
}
conditions {
display_name = "Uptime check failing"
condition_threshold {
filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\""
duration = "300s" # must fail for 5 minutes before alerting
comparison = "COMPARISON_GT"
threshold_value = 0
aggregations {
alignment_period = "300s"
per_series_aligner = "ALIGN_NEXT_OLDER"
cross_series_reducer = "REDUCE_COUNT_FALSE"
group_by_fields = ["resource.label.host"]
}
trigger {
count = 1
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.tf` around lines 57 - 76, The alert condition in the
condition_threshold block currently uses COMPARISON_GT with threshold_value = 1
and reducer REDUCE_COUNT_FALSE, which requires >1 failures to trigger; update
threshold_value to 0 so that any reduced count of 1 (a single uptime check
failure per alignment_period) will satisfy COMPARISON_GT and fire the
alert—modify the threshold_value in the condition_threshold block (the same
block that defines aggregations with alignment_period, per_series_aligner,
cross_series_reducer = "REDUCE_COUNT_FALSE", and trigger) from 1 to 0.

@kixelated kixelated force-pushed the claude/cdn-node-health-checks-abLA2 branch from 53f4e69 to 211d2be Compare February 17, 2026 17:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cdn/health.tf (1)

21-26: JWT token exposed in uptime check URL and Terraform state.

The JWT is embedded as a query parameter in the path, which means it's stored in plaintext in both the Terraform state file and the GCP uptime check configuration (visible in the Cloud Console). If this is a long-lived token, consider:

  1. Rotating the token periodically.
  2. Ensuring the Terraform state backend is encrypted and access-controlled.
  3. Marking health_jwt as sensitive in the locals block to at least prevent it from appearing in CLI output (though it will still be in state).

If this is a low-privilege, demo-only subscription token, this may be acceptable — but worth documenting the trade-off.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.tf` around lines 21 - 26, The uptime check currently embeds
local.health_jwt in http_check.path which stores the token in Terraform state
and GCP config; remove the JWT from the URL and instead fetch it from a secret
backend (e.g., google_secret_manager_secret_version or another secure data
source) and inject it into the uptime check as an HTTP header (e.g.,
Authorization: Bearer <token>) using the http_check headers/request_headers
setting on the uptime check resource (the http_check block in this file), and
mark the local health_jwt value as sensitive in your locals or avoid a local
entirely so it is not emitted in CLI output; also ensure your Terraform backend
is encrypted and access-controlled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cdn/health.tf`:
- Around line 60-64: The alert condition in the condition_threshold block uses
comparison = "COMPARISON_GT" with threshold_value = 1 which prevents firing on a
single uptime check failure; update the condition_threshold (the block
containing filter = "resource.type = \"uptime_url\" AND metric.type =
\"monitoring.googleapis.com/uptime_check/check_passed\"") to use threshold_value
= 0 so that COMPARISON_GT > 0 will trigger on a single failure.
- Around line 9-11: The local `health_jwt` uses
file("${path.module}/secrets/demo-sub.jwt") which causes a plan-time failure if
the file doesn't exist; guard that call with
fileexists("${path.module}/secrets/demo-sub.jwt") in the `locals` block so
`health_jwt` is set to a safe default (e.g., empty string or null) when the file
is missing, otherwise use trimspace(file(...)); update the `locals` entry named
`health_jwt` to perform this existence check using fileexists before calling
file.

---

Nitpick comments:
In `@cdn/health.tf`:
- Around line 21-26: The uptime check currently embeds local.health_jwt in
http_check.path which stores the token in Terraform state and GCP config; remove
the JWT from the URL and instead fetch it from a secret backend (e.g.,
google_secret_manager_secret_version or another secure data source) and inject
it into the uptime check as an HTTP header (e.g., Authorization: Bearer <token>)
using the http_check headers/request_headers setting on the uptime check
resource (the http_check block in this file), and mark the local health_jwt
value as sensitive in your locals or avoid a local entirely so it is not emitted
in CLI output; also ensure your Terraform backend is encrypted and
access-controlled.

… at plan time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kixelated kixelated enabled auto-merge (squash) February 17, 2026 18:00
@kixelated kixelated disabled auto-merge February 17, 2026 18:00
@kixelated kixelated enabled auto-merge (squash) February 17, 2026 18:01
@kixelated kixelated merged commit af048ae into main Feb 17, 2026
1 check passed
@kixelated kixelated deleted the claude/cdn-node-health-checks-abLA2 branch February 17, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants