Add health monitoring for MoQ CDN relay nodes by kixelated · Pull Request #967 · moq-dev/moq

kixelated · 2026-02-17T17:16:55Z

Summary

This PR adds comprehensive health monitoring for the MoQ CDN relay nodes using Google Cloud Platform's monitoring and alerting services, along with a local health check script.

Key Changes

New Terraform configuration (cdn/health.tf): Sets up GCP uptime checks for each relay node (usc, euc, sea) that verify the /fetch/demo/bbb/catalog.json endpoint every 5 minutes. Includes an optional email notification channel and alert policy that triggers when any node fails its health check for 5+ minutes.
New health check script (cdn/health.sh): Bash script that manually tests all relay nodes and reports their status. Supports optional webhook notifications (Slack, Discord, etc.) on failure.
Updated justfile: Added health and health-notify recipes to easily run health checks locally.
Updated README: Documents how to generate the demo-sub.jwt token needed for health checks and subscribing to demo content.
Updated main.tf: Enabled the monitoring.googleapis.com GCP service required for uptime checks and alerting.

Implementation Details

Health checks use the same JWT token (demo-sub.jwt) for authentication as the website and other subscribers
GCP uptime checks run every 5 minutes with a 10-second timeout
Alerts only fire after 5 minutes of continuous failure to avoid false positives
Email notifications are optional (only created if health_email variable is provided)
The local health check script can be run manually or integrated with external monitoring systems via webhook

https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV

- Add per-node GCP uptime checks that probe each relay (usc, euc, sea) via the /fetch/demo/bbb/catalog.json endpoint every 5 minutes - Add alert policy with email notifications on sustained failures - Add standalone health.sh script for manual/cron checks with optional webhook (Slack/Discord) notification on failure - Add `health` and `health-notify` just commands - Enable monitoring.googleapis.com API https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV

The demo GET token is generated alongside the other secrets (demo-pub.jwt, cluster.jwt) using moq-token. Both the Terraform uptime checks and the health.sh script now read from secrets/demo-get.jwt automatically. https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV

coderabbitai · 2026-02-17T17:21:25Z

No actionable comments were generated in the recent review. 🎉

Walkthrough

Adds CDN health monitoring: README expanded with a demo subscription creation step; a Bash health-check that reads a demo JWT, probes multiple relay nodes for a demo catalog endpoint, reports per-node status, and can post failure summaries to an optional webhook; Terraform additions that introduce a health_email input, create uptime checks per relay, and conditionally provision an email notification channel and alert policy; new justfile targets to run health checks; and includes monitoring.googleapis.com in the default GCP service list.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main purpose of the PR: adding health monitoring for MoQ CDN relay nodes, which is clearly the primary objective reflected throughout all changeset files.
Description check	✅ Passed	The description is comprehensive and well-related to the changeset, detailing all major changes including Terraform health configuration, health check script, justfile updates, README documentation, and GCP service enablement.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch claude/cdn-node-health-checks-abLA2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

cdn/health.tf (1)
21-22: JWT token exposed in URL query parameter.

The JWT is embedded in the uptime check URL, which means it will be visible in GCP Monitoring console, logs, and API responses to anyone with project access. Since this is a read-only token for demo content, the risk is limited, but worth being aware of — especially if the token's scope ever broadens.

GCP uptime checks support custom headers via http_check.headers, which would keep the token out of the URL. Worth considering if the JWT grants more than demo read access.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.tf` around lines 21 - 22, The uptime check embeds local.health_jwt
in the URL (http_check.path), exposing the JWT in monitoring/logs; change the
check to remove the JWT from the path and instead set it in http_check.headers
(e.g., add an Authorization or X-Health-JWT header) so the token is sent as a
header; update the http_check block to reference local.health_jwt in
http_check.headers rather than in the path.
cdn/health.sh (1)
56-60: Webhook JSON payload is not escaped.

The $msg variable is interpolated directly into the JSON string. Currently safe since node names are simple alphanumeric strings, but if any node name or domain ever contained quotes or backslashes, the JSON would be malformed. A minor hardening would be to use jq for JSON construction, if available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.sh` around lines 56 - 60, The webhook JSON currently interpolates
$msg directly into the curl -d string, which can break JSON if $msg contains
quotes or backslashes; change the block that checks WEBHOOK_URL to build a safe
JSON payload instead of inline interpolation: detect or require jq and generate
the payload with jq -n --arg text "$msg" '{text:$text}' (send it to curl via
--data `@-`), or if jq is not available use a robust escaping fallback (e.g.,
escape backslashes and quotes in $msg before embedding) so the curl invocation
always receives valid JSON; update references to WEBHOOK_URL, msg and the curl
POST logic accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cdn/health.sh`:
- Line 8: Update the stale comment in cdn/health.sh that currently reads "Reads
the JWT from secrets/demo-get.jwt" to reference the renamed file
"secrets/demo-sub.jwt" (i.e., change the comment string to mention demo-sub.jwt)
so the comment matches the actual token filename used by the script.

In `@cdn/health.tf`:
- Around line 9-11: The locals block currently uses
file("${path.module}/secrets/demo-sub.jwt") to populate health_jwt which causes
Terraform to fail during plan; change it to conditionally read the file only if
it exists by using fileexists("${path.module}/secrets/demo-sub.jwt") as a guard
and otherwise defaulting to an empty string (keep the symbol health_jwt and the
locals block unchanged), so replace the direct file(...) call with a conditional
expression using fileexists(...) ? file(...) : "" to allow plan/apply to succeed
before the secrets directory is created.
- Around line 57-76: The alert condition in the condition_threshold block
currently uses COMPARISON_GT with threshold_value = 1 and reducer
REDUCE_COUNT_FALSE, which requires >1 failures to trigger; update
threshold_value to 0 so that any reduced count of 1 (a single uptime check
failure per alignment_period) will satisfy COMPARISON_GT and fire the
alert—modify the threshold_value in the condition_threshold block (the same
block that defines aggregations with alignment_period, per_series_aligner,
cross_series_reducer = "REDUCE_COUNT_FALSE", and trigger) from 1 to 0.

---

Nitpick comments:
In `@cdn/health.sh`:
- Around line 56-60: The webhook JSON currently interpolates $msg directly into
the curl -d string, which can break JSON if $msg contains quotes or backslashes;
change the block that checks WEBHOOK_URL to build a safe JSON payload instead of
inline interpolation: detect or require jq and generate the payload with jq -n
--arg text "$msg" '{text:$text}' (send it to curl via --data `@-`), or if jq is
not available use a robust escaping fallback (e.g., escape backslashes and
quotes in $msg before embedding) so the curl invocation always receives valid
JSON; update references to WEBHOOK_URL, msg and the curl POST logic accordingly.

In `@cdn/health.tf`:
- Around line 21-22: The uptime check embeds local.health_jwt in the URL
(http_check.path), exposing the JWT in monitoring/logs; change the check to
remove the JWT from the path and instead set it in http_check.headers (e.g., add
an Authorization or X-Health-JWT header) so the token is sent as a header;
update the http_check block to reference local.health_jwt in http_check.headers
rather than in the path.

cdn/health.sh

coderabbitai · 2026-02-17T17:21:28Z

cdn/health.tf

+locals {
+  health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt"))
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if cdn/health.tf exists and examine the context around line 10 head -20 cdn/health.tf

Repository: moq-dev/moq

Length of output: 601

🏁 Script executed:

# Check if the secrets file exists ls -la cdn/secrets/ 2>/dev/null || echo "secrets directory does not exist"

Repository: moq-dev/moq

Length of output: 87

🏁 Script executed:

# Search for other file() calls in terraform files to understand the pattern rg -n 'file\s*\(' --type=tf -B2 -A2

Repository: moq-dev/moq

Length of output: 1072

🏁 Script executed:

# Look for any bootstrap or setup documentation find . -type f -name "*.md" -o -name "*.txt" -o -name "setup*" -o -name "bootstrap*" | head -20

Repository: moq-dev/moq

Length of output: 487

🏁 Script executed:

# Check cdn/README.md for deployment instructions cat cdn/README.md

Repository: moq-dev/moq

Length of output: 2290

🏁 Script executed:

# Check if bootstrap.sh mentions generating the JWT head -50 cdn/bootstrap.sh

Repository: moq-dev/moq

Length of output: 1135

🏁 Script executed:

# Check cdn/bootstrap.tf to see the full bootstrap context cat cdn/bootstrap.tf

Repository: moq-dev/moq

Length of output: 653

The file() call on line 10 will block the documented bootstrap workflow.

The README's Setup section explicitly requires running tofu apply (Step 3) before creating the secrets/ directory with the JWT (Step 4). However, file() is evaluated during tofu plan (part of tofu apply), so the plan will fail before the user can generate the JWT, blocking the initial deployment.

The suggested fix using fileexists() guard is appropriate:

Option: guard with fileexists

locals { - health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt")) + health_jwt = fileexists("${path.module}/secrets/demo-sub.jwt") ? trimspace(file("${path.module}/secrets/demo-sub.jwt")) : "" }

This allows the initial tofu plan/apply to succeed (with an empty JWT) before secrets are generated, unblocking the bootstrap sequence.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

locals {

health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt"))

}

locals {

health_jwt = fileexists("${path.module}/secrets/demo-sub.jwt") ? trimspace(file("${path.module}/secrets/demo-sub.jwt")) : ""

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cdn/health.tf` around lines 9 - 11, The locals block currently uses file("${path.module}/secrets/demo-sub.jwt") to populate health_jwt which causes Terraform to fail during plan; change it to conditionally read the file only if it exists by using fileexists("${path.module}/secrets/demo-sub.jwt") as a guard and otherwise defaulting to an empty string (keep the symbol health_jwt and the locals block unchanged), so replace the direct file(...) call with a conditional expression using fileexists(...) ? file(...) : "" to allow plan/apply to succeed before the secrets directory is created.

coderabbitai · 2026-02-17T17:21:28Z

cdn/health.tf

+  conditions {
+    display_name = "Uptime check failing"
+
+    condition_threshold {
+      filter          = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\""
+      duration        = "300s" # must fail for 5 minutes before alerting
+      comparison      = "COMPARISON_GT"
+      threshold_value = 1
+
+      aggregations {
+        alignment_period     = "300s"
+        per_series_aligner   = "ALIGN_NEXT_OLDER"
+        cross_series_reducer = "REDUCE_COUNT_FALSE"
+        group_by_fields      = ["resource.label.host"]
+      }
+
+      trigger {
+        count = 1
+      }
+    }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

GCP monitoring REDUCE_COUNT_FALSE COMPARISON_GT threshold condition

💡 Result:

In Cloud Monitoring metric-threshold conditions, REDUCE_COUNT_FALSE is a cross-series reducer that converts Boolean time series into an INT64 count of “false” points per alignment period (across the grouped time series you’re reducing). (docs.cloud.google.com)

How REDUCE_COUNT_FALSE + COMPARISON_GT + threshold evaluates

Align first (required): If you set a crossSeriesReducer (like REDUCE_COUNT_FALSE), you must also set:

alignmentPeriod

perSeriesAligner (and it must not be ALIGN_NONE) (docs.cloud.google.com)

Reduce across series: For each alignment window, Monitoring computes how many “false” data points occurred across the time series in each group (per groupByFields). Output is INT64. (docs.cloud.google.com)

Compare to threshold: COMPARISON_GT triggers when the reduced value is greater than thresholdValue (left side is the time series value, right side is the threshold). Cloud Monitoring metric-threshold conditions support only COMPARISON_GT and COMPARISON_LT. (cloud.google.com)

Hold for duration: The comparison must remain true for duration to be considered failing. (cloud.google.com)

Example (API-style / Terraform-style shape)

This triggers if the count of false points in each 60s window is > 0 (i.e., at least one “false”) for 5 minutes:

{ "conditionThreshold": { "filter": "metric.type=\"custom.googleapis.com/my/boolean_metric\" AND resource.type=\"gce_instance\"", "aggregations": [{ "alignmentPeriod": "60s", "perSeriesAligner": "ALIGN_NEXT_OLDER", "crossSeriesReducer": "REDUCE_COUNT_FALSE", "groupByFields": ["resource.label.instance_id"] }], "comparison": "COMPARISON_GT", "thresholdValue": 0, "duration": "300s" } }

If you share your metric type (and whether you want to count false across instances or per instance), I can suggest the right groupByFields and aligner for your exact intent.

Citations:

1: https://docs.cloud.google.com/java/docs/reference/google-cloud-monitoring/latest/com.google.monitoring.v3.Aggregation.Reducer?utm_source=openai

2: https://docs.cloud.google.com/config-connector/docs/reference/resource-docs/monitoring/monitoringalertpolicy?utm_source=openai

3: https://docs.cloud.google.com/java/docs/reference/google-cloud-monitoring/latest/com.google.monitoring.v3.Aggregation.Reducer?utm_source=openai

4: https://cloud.google.com/monitoring/api/ref_v3/rpc/google.monitoring.v3?utm_source=openai

5: https://cloud.google.com/monitoring/api/ref_v3/rpc/google.monitoring.v3?utm_source=openai

Fix alert threshold to trigger on any uptime check failure.

The condition uses REDUCE_COUNT_FALSE with COMPARISON_GT and threshold_value = 1. According to GCP's metric-threshold evaluation, COMPARISON_GT triggers when the reduced count is greater than the threshold value. With threshold_value = 1, the alert fires only when count > 1 (requiring 2+ failures per alignment period).

Given that each host receives one uptime check per 300s and the alignment period is also 300s, you'll have at most one data point per host per window. A single failure will produce a count of 1, which is not > 1, so the alert will never trigger. Change the threshold to 0 to fire when any check fails:

diff

- threshold_value = 1 + threshold_value = 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

conditions {

display_name = "Uptime check failing"

condition_threshold {

filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\""

duration = "300s" # must fail for 5 minutes before alerting

comparison = "COMPARISON_GT"

threshold_value = 1

aggregations {

alignment_period = "300s"

per_series_aligner = "ALIGN_NEXT_OLDER"

cross_series_reducer = "REDUCE_COUNT_FALSE"

group_by_fields = ["resource.label.host"]

}

trigger {

count = 1

}

}

conditions {

display_name = "Uptime check failing"

condition_threshold {

filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\""

duration = "300s" # must fail for 5 minutes before alerting

comparison = "COMPARISON_GT"

threshold_value = 0

aggregations {

alignment_period = "300s"

per_series_aligner = "ALIGN_NEXT_OLDER"

cross_series_reducer = "REDUCE_COUNT_FALSE"

group_by_fields = ["resource.label.host"]

}

trigger {

count = 1

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cdn/health.tf` around lines 57 - 76, The alert condition in the condition_threshold block currently uses COMPARISON_GT with threshold_value = 1 and reducer REDUCE_COUNT_FALSE, which requires >1 failures to trigger; update threshold_value to 0 so that any reduced count of 1 (a single uptime check failure per alignment_period) will satisfy COMPARISON_GT and fire the alert—modify the threshold_value in the condition_threshold block (the same block that defines aggregations with alignment_period, per_series_aligner, cross_series_reducer = "REDUCE_COUNT_FALSE", and trigger) from 1 to 0.

https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV

coderabbitai

🧹 Nitpick comments (1)

cdn/health.tf (1)
21-26: JWT token exposed in uptime check URL and Terraform state.

The JWT is embedded as a query parameter in the path, which means it's stored in plaintext in both the Terraform state file and the GCP uptime check configuration (visible in the Cloud Console). If this is a long-lived token, consider:

Rotating the token periodically.

Ensuring the Terraform state backend is encrypted and access-controlled.

Marking health_jwt as sensitive in the locals block to at least prevent it from appearing in CLI output (though it will still be in state).

If this is a low-privilege, demo-only subscription token, this may be acceptable — but worth documenting the trade-off.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.tf` around lines 21 - 26, The uptime check currently embeds
local.health_jwt in http_check.path which stores the token in Terraform state
and GCP config; remove the JWT from the URL and instead fetch it from a secret
backend (e.g., google_secret_manager_secret_version or another secure data
source) and inject it into the uptime check as an HTTP header (e.g.,
Authorization: Bearer <token>) using the http_check headers/request_headers
setting on the uptime check resource (the http_check block in this file), and
mark the local health_jwt value as sensitive in your locals or avoid a local
entirely so it is not emitted in CLI output; also ensure your Terraform backend
is encrypted and access-controlled.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cdn/health.tf`:
- Around line 60-64: The alert condition in the condition_threshold block uses
comparison = "COMPARISON_GT" with threshold_value = 1 which prevents firing on a
single uptime check failure; update the condition_threshold (the block
containing filter = "resource.type = \"uptime_url\" AND metric.type =
\"monitoring.googleapis.com/uptime_check/check_passed\"") to use threshold_value
= 0 so that COMPARISON_GT > 0 will trigger on a single failure.
- Around line 9-11: The local `health_jwt` uses
file("${path.module}/secrets/demo-sub.jwt") which causes a plan-time failure if
the file doesn't exist; guard that call with
fileexists("${path.module}/secrets/demo-sub.jwt") in the `locals` block so
`health_jwt` is set to a safe default (e.g., empty string or null) when the file
is missing, otherwise use trimspace(file(...)); update the `locals` entry named
`health_jwt` to perform this existence check using fileexists before calling
file.

---

Nitpick comments:
In `@cdn/health.tf`:
- Around line 21-26: The uptime check currently embeds local.health_jwt in
http_check.path which stores the token in Terraform state and GCP config; remove
the JWT from the URL and instead fetch it from a secret backend (e.g.,
google_secret_manager_secret_version or another secure data source) and inject
it into the uptime check as an HTTP header (e.g., Authorization: Bearer <token>)
using the http_check headers/request_headers setting on the uptime check
resource (the http_check block in this file), and mark the local health_jwt
value as sensitive in your locals or avoid a local entirely so it is not emitted
in CLI output; also ensure your Terraform backend is encrypted and
access-controlled.

… at plan time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude added 2 commits February 14, 2026 14:37

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

Rename demo-get.jwt to demo-sub.jwt

211d2be

https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV

kixelated force-pushed the claude/cdn-node-health-checks-abLA2 branch from 53f4e69 to 211d2be Compare February 17, 2026 17:51

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

Generate secrets before tofu apply since health.tf reads demo-sub.jwt…

24ff160

… at plan time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kixelated enabled auto-merge (squash) February 17, 2026 18:00

kixelated disabled auto-merge February 17, 2026 18:00

kixelated enabled auto-merge (squash) February 17, 2026 18:01

kixelated merged commit af048ae into main Feb 17, 2026
1 check passed

kixelated deleted the claude/cdn-node-health-checks-abLA2 branch February 17, 2026 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health monitoring for MoQ CDN relay nodes#967

Add health monitoring for MoQ CDN relay nodes#967
kixelated merged 4 commits intomainfrom
claude/cdn-node-health-checks-abLA2

kixelated commented Feb 17, 2026

Uh oh!

coderabbitai bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Feb 17, 2026

Uh oh!

coderabbitai bot Feb 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kixelated commented Feb 17, 2026

Summary

Key Changes

Implementation Details

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

How REDUCE_COUNT_FALSE + COMPARISON_GT + threshold evaluates

Example (API-style / Terraform-style shape)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 17, 2026 •

edited

Loading

How `REDUCE_COUNT_FALSE + COMPARISON_GT + threshold` evaluates