Add health monitoring for MoQ CDN relay nodes#967
Conversation
- Add per-node GCP uptime checks that probe each relay (usc, euc, sea) via the /fetch/demo/bbb/catalog.json endpoint every 5 minutes - Add alert policy with email notifications on sustained failures - Add standalone health.sh script for manual/cron checks with optional webhook (Slack/Discord) notification on failure - Add `health` and `health-notify` just commands - Enable monitoring.googleapis.com API https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV
The demo GET token is generated alongside the other secrets (demo-pub.jwt, cluster.jwt) using moq-token. Both the Terraform uptime checks and the health.sh script now read from secrets/demo-get.jwt automatically. https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV
|
No actionable comments were generated in the recent review. 🎉 WalkthroughAdds CDN health monitoring: README expanded with a demo subscription creation step; a Bash health-check that reads a demo JWT, probes multiple relay nodes for a demo catalog endpoint, reports per-node status, and can post failure summaries to an optional webhook; Terraform additions that introduce a health_email input, create uptime checks per relay, and conditionally provision an email notification channel and alert policy; new justfile targets to run health checks; and includes monitoring.googleapis.com in the default GCP service list. 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
cdn/health.tf (1)
21-22: JWT token exposed in URL query parameter.The JWT is embedded in the uptime check URL, which means it will be visible in GCP Monitoring console, logs, and API responses to anyone with project access. Since this is a read-only token for demo content, the risk is limited, but worth being aware of — especially if the token's scope ever broadens.
GCP uptime checks support custom headers via
http_check.headers, which would keep the token out of the URL. Worth considering if the JWT grants more than demo read access.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cdn/health.tf` around lines 21 - 22, The uptime check embeds local.health_jwt in the URL (http_check.path), exposing the JWT in monitoring/logs; change the check to remove the JWT from the path and instead set it in http_check.headers (e.g., add an Authorization or X-Health-JWT header) so the token is sent as a header; update the http_check block to reference local.health_jwt in http_check.headers rather than in the path.cdn/health.sh (1)
56-60: Webhook JSON payload is not escaped.The
$msgvariable is interpolated directly into the JSON string. Currently safe since node names are simple alphanumeric strings, but if any node name or domain ever contained quotes or backslashes, the JSON would be malformed. A minor hardening would be to usejqfor JSON construction, if available.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cdn/health.sh` around lines 56 - 60, The webhook JSON currently interpolates $msg directly into the curl -d string, which can break JSON if $msg contains quotes or backslashes; change the block that checks WEBHOOK_URL to build a safe JSON payload instead of inline interpolation: detect or require jq and generate the payload with jq -n --arg text "$msg" '{text:$text}' (send it to curl via --data `@-`), or if jq is not available use a robust escaping fallback (e.g., escape backslashes and quotes in $msg before embedding) so the curl invocation always receives valid JSON; update references to WEBHOOK_URL, msg and the curl POST logic accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cdn/health.sh`:
- Line 8: Update the stale comment in cdn/health.sh that currently reads "Reads
the JWT from secrets/demo-get.jwt" to reference the renamed file
"secrets/demo-sub.jwt" (i.e., change the comment string to mention demo-sub.jwt)
so the comment matches the actual token filename used by the script.
In `@cdn/health.tf`:
- Around line 9-11: The locals block currently uses
file("${path.module}/secrets/demo-sub.jwt") to populate health_jwt which causes
Terraform to fail during plan; change it to conditionally read the file only if
it exists by using fileexists("${path.module}/secrets/demo-sub.jwt") as a guard
and otherwise defaulting to an empty string (keep the symbol health_jwt and the
locals block unchanged), so replace the direct file(...) call with a conditional
expression using fileexists(...) ? file(...) : "" to allow plan/apply to succeed
before the secrets directory is created.
- Around line 57-76: The alert condition in the condition_threshold block
currently uses COMPARISON_GT with threshold_value = 1 and reducer
REDUCE_COUNT_FALSE, which requires >1 failures to trigger; update
threshold_value to 0 so that any reduced count of 1 (a single uptime check
failure per alignment_period) will satisfy COMPARISON_GT and fire the
alert—modify the threshold_value in the condition_threshold block (the same
block that defines aggregations with alignment_period, per_series_aligner,
cross_series_reducer = "REDUCE_COUNT_FALSE", and trigger) from 1 to 0.
---
Nitpick comments:
In `@cdn/health.sh`:
- Around line 56-60: The webhook JSON currently interpolates $msg directly into
the curl -d string, which can break JSON if $msg contains quotes or backslashes;
change the block that checks WEBHOOK_URL to build a safe JSON payload instead of
inline interpolation: detect or require jq and generate the payload with jq -n
--arg text "$msg" '{text:$text}' (send it to curl via --data `@-`), or if jq is
not available use a robust escaping fallback (e.g., escape backslashes and
quotes in $msg before embedding) so the curl invocation always receives valid
JSON; update references to WEBHOOK_URL, msg and the curl POST logic accordingly.
In `@cdn/health.tf`:
- Around line 21-22: The uptime check embeds local.health_jwt in the URL
(http_check.path), exposing the JWT in monitoring/logs; change the check to
remove the JWT from the path and instead set it in http_check.headers (e.g., add
an Authorization or X-Health-JWT header) so the token is sent as a header;
update the http_check block to reference local.health_jwt in http_check.headers
rather than in the path.
| locals { | ||
| health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt")) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if cdn/health.tf exists and examine the context around line 10
head -20 cdn/health.tfRepository: moq-dev/moq
Length of output: 601
🏁 Script executed:
# Check if the secrets file exists
ls -la cdn/secrets/ 2>/dev/null || echo "secrets directory does not exist"Repository: moq-dev/moq
Length of output: 87
🏁 Script executed:
# Search for other file() calls in terraform files to understand the pattern
rg -n 'file\s*\(' --type=tf -B2 -A2Repository: moq-dev/moq
Length of output: 1072
🏁 Script executed:
# Look for any bootstrap or setup documentation
find . -type f -name "*.md" -o -name "*.txt" -o -name "setup*" -o -name "bootstrap*" | head -20Repository: moq-dev/moq
Length of output: 487
🏁 Script executed:
# Check cdn/README.md for deployment instructions
cat cdn/README.mdRepository: moq-dev/moq
Length of output: 2290
🏁 Script executed:
# Check if bootstrap.sh mentions generating the JWT
head -50 cdn/bootstrap.shRepository: moq-dev/moq
Length of output: 1135
🏁 Script executed:
# Check cdn/bootstrap.tf to see the full bootstrap context
cat cdn/bootstrap.tfRepository: moq-dev/moq
Length of output: 653
The file() call on line 10 will block the documented bootstrap workflow.
The README's Setup section explicitly requires running tofu apply (Step 3) before creating the secrets/ directory with the JWT (Step 4). However, file() is evaluated during tofu plan (part of tofu apply), so the plan will fail before the user can generate the JWT, blocking the initial deployment.
The suggested fix using fileexists() guard is appropriate:
Option: guard with fileexists
locals {
- health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt"))
+ health_jwt = fileexists("${path.module}/secrets/demo-sub.jwt") ? trimspace(file("${path.module}/secrets/demo-sub.jwt")) : ""
}This allows the initial tofu plan/apply to succeed (with an empty JWT) before secrets are generated, unblocking the bootstrap sequence.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| locals { | |
| health_jwt = trimspace(file("${path.module}/secrets/demo-sub.jwt")) | |
| } | |
| locals { | |
| health_jwt = fileexists("${path.module}/secrets/demo-sub.jwt") ? trimspace(file("${path.module}/secrets/demo-sub.jwt")) : "" | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/health.tf` around lines 9 - 11, The locals block currently uses
file("${path.module}/secrets/demo-sub.jwt") to populate health_jwt which causes
Terraform to fail during plan; change it to conditionally read the file only if
it exists by using fileexists("${path.module}/secrets/demo-sub.jwt") as a guard
and otherwise defaulting to an empty string (keep the symbol health_jwt and the
locals block unchanged), so replace the direct file(...) call with a conditional
expression using fileexists(...) ? file(...) : "" to allow plan/apply to succeed
before the secrets directory is created.
| conditions { | ||
| display_name = "Uptime check failing" | ||
|
|
||
| condition_threshold { | ||
| filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\"" | ||
| duration = "300s" # must fail for 5 minutes before alerting | ||
| comparison = "COMPARISON_GT" | ||
| threshold_value = 1 | ||
|
|
||
| aggregations { | ||
| alignment_period = "300s" | ||
| per_series_aligner = "ALIGN_NEXT_OLDER" | ||
| cross_series_reducer = "REDUCE_COUNT_FALSE" | ||
| group_by_fields = ["resource.label.host"] | ||
| } | ||
|
|
||
| trigger { | ||
| count = 1 | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
GCP monitoring REDUCE_COUNT_FALSE COMPARISON_GT threshold condition
💡 Result:
In Cloud Monitoring metric-threshold conditions, REDUCE_COUNT_FALSE is a cross-series reducer that converts Boolean time series into an INT64 count of “false” points per alignment period (across the grouped time series you’re reducing). (docs.cloud.google.com)
How REDUCE_COUNT_FALSE + COMPARISON_GT + threshold evaluates
- Align first (required): If you set a
crossSeriesReducer(likeREDUCE_COUNT_FALSE), you must also set:alignmentPeriodperSeriesAligner(and it must not beALIGN_NONE) (docs.cloud.google.com)
- Reduce across series: For each alignment window, Monitoring computes how many “false” data points occurred across the time series in each group (per
groupByFields). Output is INT64. (docs.cloud.google.com) - Compare to threshold:
COMPARISON_GTtriggers when the reduced value is greater thanthresholdValue(left side is the time series value, right side is the threshold). Cloud Monitoring metric-threshold conditions support onlyCOMPARISON_GTandCOMPARISON_LT. (cloud.google.com) - Hold for duration: The comparison must remain true for
durationto be considered failing. (cloud.google.com)
Example (API-style / Terraform-style shape)
This triggers if the count of false points in each 60s window is > 0 (i.e., at least one “false”) for 5 minutes:
{
"conditionThreshold": {
"filter": "metric.type=\"custom.googleapis.com/my/boolean_metric\" AND resource.type=\"gce_instance\"",
"aggregations": [{
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_NEXT_OLDER",
"crossSeriesReducer": "REDUCE_COUNT_FALSE",
"groupByFields": ["resource.label.instance_id"]
}],
"comparison": "COMPARISON_GT",
"thresholdValue": 0,
"duration": "300s"
}
}If you share your metric type (and whether you want to count false across instances or per instance), I can suggest the right groupByFields and aligner for your exact intent.
Citations:
- 1: https://docs.cloud.google.com/java/docs/reference/google-cloud-monitoring/latest/com.google.monitoring.v3.Aggregation.Reducer?utm_source=openai
- 2: https://docs.cloud.google.com/config-connector/docs/reference/resource-docs/monitoring/monitoringalertpolicy?utm_source=openai
- 3: https://docs.cloud.google.com/java/docs/reference/google-cloud-monitoring/latest/com.google.monitoring.v3.Aggregation.Reducer?utm_source=openai
- 4: https://cloud.google.com/monitoring/api/ref_v3/rpc/google.monitoring.v3?utm_source=openai
- 5: https://cloud.google.com/monitoring/api/ref_v3/rpc/google.monitoring.v3?utm_source=openai
Fix alert threshold to trigger on any uptime check failure.
The condition uses REDUCE_COUNT_FALSE with COMPARISON_GT and threshold_value = 1. According to GCP's metric-threshold evaluation, COMPARISON_GT triggers when the reduced count is greater than the threshold value. With threshold_value = 1, the alert fires only when count > 1 (requiring 2+ failures per alignment period).
Given that each host receives one uptime check per 300s and the alignment period is also 300s, you'll have at most one data point per host per window. A single failure will produce a count of 1, which is not > 1, so the alert will never trigger. Change the threshold to 0 to fire when any check fails:
diff
- threshold_value = 1
+ threshold_value = 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| conditions { | |
| display_name = "Uptime check failing" | |
| condition_threshold { | |
| filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\"" | |
| duration = "300s" # must fail for 5 minutes before alerting | |
| comparison = "COMPARISON_GT" | |
| threshold_value = 1 | |
| aggregations { | |
| alignment_period = "300s" | |
| per_series_aligner = "ALIGN_NEXT_OLDER" | |
| cross_series_reducer = "REDUCE_COUNT_FALSE" | |
| group_by_fields = ["resource.label.host"] | |
| } | |
| trigger { | |
| count = 1 | |
| } | |
| } | |
| conditions { | |
| display_name = "Uptime check failing" | |
| condition_threshold { | |
| filter = "resource.type = \"uptime_url\" AND metric.type = \"monitoring.googleapis.com/uptime_check/check_passed\"" | |
| duration = "300s" # must fail for 5 minutes before alerting | |
| comparison = "COMPARISON_GT" | |
| threshold_value = 0 | |
| aggregations { | |
| alignment_period = "300s" | |
| per_series_aligner = "ALIGN_NEXT_OLDER" | |
| cross_series_reducer = "REDUCE_COUNT_FALSE" | |
| group_by_fields = ["resource.label.host"] | |
| } | |
| trigger { | |
| count = 1 | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/health.tf` around lines 57 - 76, The alert condition in the
condition_threshold block currently uses COMPARISON_GT with threshold_value = 1
and reducer REDUCE_COUNT_FALSE, which requires >1 failures to trigger; update
threshold_value to 0 so that any reduced count of 1 (a single uptime check
failure per alignment_period) will satisfy COMPARISON_GT and fire the
alert—modify the threshold_value in the condition_threshold block (the same
block that defines aggregations with alignment_period, per_series_aligner,
cross_series_reducer = "REDUCE_COUNT_FALSE", and trigger) from 1 to 0.
53f4e69 to
211d2be
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cdn/health.tf (1)
21-26: JWT token exposed in uptime check URL and Terraform state.The JWT is embedded as a query parameter in the
path, which means it's stored in plaintext in both the Terraform state file and the GCP uptime check configuration (visible in the Cloud Console). If this is a long-lived token, consider:
- Rotating the token periodically.
- Ensuring the Terraform state backend is encrypted and access-controlled.
- Marking
health_jwtassensitivein the locals block to at least prevent it from appearing in CLI output (though it will still be in state).If this is a low-privilege, demo-only subscription token, this may be acceptable — but worth documenting the trade-off.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cdn/health.tf` around lines 21 - 26, The uptime check currently embeds local.health_jwt in http_check.path which stores the token in Terraform state and GCP config; remove the JWT from the URL and instead fetch it from a secret backend (e.g., google_secret_manager_secret_version or another secure data source) and inject it into the uptime check as an HTTP header (e.g., Authorization: Bearer <token>) using the http_check headers/request_headers setting on the uptime check resource (the http_check block in this file), and mark the local health_jwt value as sensitive in your locals or avoid a local entirely so it is not emitted in CLI output; also ensure your Terraform backend is encrypted and access-controlled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cdn/health.tf`:
- Around line 60-64: The alert condition in the condition_threshold block uses
comparison = "COMPARISON_GT" with threshold_value = 1 which prevents firing on a
single uptime check failure; update the condition_threshold (the block
containing filter = "resource.type = \"uptime_url\" AND metric.type =
\"monitoring.googleapis.com/uptime_check/check_passed\"") to use threshold_value
= 0 so that COMPARISON_GT > 0 will trigger on a single failure.
- Around line 9-11: The local `health_jwt` uses
file("${path.module}/secrets/demo-sub.jwt") which causes a plan-time failure if
the file doesn't exist; guard that call with
fileexists("${path.module}/secrets/demo-sub.jwt") in the `locals` block so
`health_jwt` is set to a safe default (e.g., empty string or null) when the file
is missing, otherwise use trimspace(file(...)); update the `locals` entry named
`health_jwt` to perform this existence check using fileexists before calling
file.
---
Nitpick comments:
In `@cdn/health.tf`:
- Around line 21-26: The uptime check currently embeds local.health_jwt in
http_check.path which stores the token in Terraform state and GCP config; remove
the JWT from the URL and instead fetch it from a secret backend (e.g.,
google_secret_manager_secret_version or another secure data source) and inject
it into the uptime check as an HTTP header (e.g., Authorization: Bearer <token>)
using the http_check headers/request_headers setting on the uptime check
resource (the http_check block in this file), and mark the local health_jwt
value as sensitive in your locals or avoid a local entirely so it is not emitted
in CLI output; also ensure your Terraform backend is encrypted and
access-controlled.
… at plan time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This PR adds comprehensive health monitoring for the MoQ CDN relay nodes using Google Cloud Platform's monitoring and alerting services, along with a local health check script.
Key Changes
cdn/health.tf): Sets up GCP uptime checks for each relay node (usc, euc, sea) that verify the/fetch/demo/bbb/catalog.jsonendpoint every 5 minutes. Includes an optional email notification channel and alert policy that triggers when any node fails its health check for 5+ minutes.cdn/health.sh): Bash script that manually tests all relay nodes and reports their status. Supports optional webhook notifications (Slack, Discord, etc.) on failure.healthandhealth-notifyrecipes to easily run health checks locally.demo-sub.jwttoken needed for health checks and subscribing to demo content.monitoring.googleapis.comGCP service required for uptime checks and alerting.Implementation Details
demo-sub.jwt) for authentication as the website and other subscribershealth_emailvariable is provided)https://claude.ai/code/session_01RVJ1D3YsbHUyGvBFxMXCdV