Skip to content

Upgrade deploy example with production-ready Helm chart#396

Open
aliev wants to merge 1 commit intomainfrom
feat/deploy-example-helm-chart
Open

Upgrade deploy example with production-ready Helm chart#396
aliev wants to merge 1 commit intomainfrom
feat/deploy-example-helm-chart

Conversation

@aliev
Copy link
Member

@aliev aliev commented Mar 5, 2026

Summary

Turns the 07_deploy_example into a production-ready Helm chart that works on any Kubernetes cluster — GKE, EKS, AKS, Minikube, Orbstack, Docker Desktop.

Helm chart additions

  • Service (ClusterIP) and Ingress templates
  • Bundled Redis via Bitnami subchart — auto-wires REDIS_URL env var
  • Health probes — liveness on /health, readiness on /ready
  • ServiceMonitor — Prometheus scraping with configurable additionalLabels
  • Grafana dashboard — auto-provisioned ConfigMap with 8 panels (sessions, latency, tokens, CPU/memory, TTFT, turn duration)
  • Removed hardcoded Nebius GPU nodeSelector — now cloud-agnostic

Application changes

  • Redis SessionRegistry support (conditional on REDIS_URL env var)
  • Prometheus /metrics endpoint with 9 custom gauges
  • Pin getstream>=2.0.0,<3.0.0 (v3 has breaking ChannelMember import)
  • Add prometheus-client and vision-agents[redis] dependencies
  • Fix Dockerfile: serve --host 0.0.0.0 (required for k8s health probes)

Configuration highlights (values.yaml)

Setting Default Description
redis.deploy.enabled true Bundle Redis for session storage
redis.url "" External Redis URL (when bundled is off)
metrics.enabled true Create ServiceMonitor for Prometheus
grafana.enabled true Deploy Grafana dashboard ConfigMap
ingress.enabled true Create Ingress resource
gpu.enabled false GPU resources and tolerations

Documentation

Step-by-step deployment guide: GetStream/Agents-Docs#71

Test plan

  • Built and deployed on local Orbstack k8s cluster
  • Verified health probes, session creation, Redis storage
  • Verified Prometheus scraping and Grafana dashboard with live data
  • Tested full cleanup and re-deployment from scratch
  • Test on GKE/EKS cluster

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics collection for monitoring application performance (sessions, latencies, tokens).
    • Added Grafana dashboard for visualizing agent metrics and system performance.
    • Added Kubernetes Ingress support for external traffic routing.
    • Added optional Redis session registry support.
  • Chores

    • Updated project dependencies for Redis and Prometheus monitoring integration.
    • Enhanced Kubernetes Helm deployment templates and configurations.

Turn the deploy example into a production-ready Helm chart that works
on any Kubernetes cluster (GKE, EKS, AKS, local).

Helm chart additions:
- Service (ClusterIP) and Ingress templates
- Bundled Redis via Bitnami subchart (optional, for session storage)
- Auto-wiring: REDIS_URL env var set automatically when Redis is deployed
- Health probes (liveness on /health, readiness on /ready)
- ServiceMonitor for Prometheus scraping with configurable labels
- Grafana dashboard ConfigMap (8 panels: sessions, latency, tokens, CPU/memory)
- Removed hardcoded Nebius GPU nodeSelector — now fully cloud-agnostic

Application changes:
- Add Redis SessionRegistry support (conditional on REDIS_URL env var)
- Add Prometheus /metrics endpoint with 9 custom gauges
- Pin getstream<3.0.0 (v3 has breaking changes)
- Add prometheus-client and vision-agents[redis] dependencies
- Fix Dockerfile: serve mode with --host 0.0.0.0 (required for k8s probes)
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

The deployment example is enhanced with observability infrastructure (Prometheus metrics, Grafana dashboard), Kubernetes resource templates (Service, Ingress, ServiceMonitor), Redis-based session management, and application health checks. The application now collects and exposes metrics, with corresponding Helm configurations and updated dependencies.

Changes

Cohort / File(s) Summary
Application Core
deploy_example.py, Dockerfile, pyproject.toml
Added Prometheus metrics gauges (session count, latencies, token counts), metrics collection aggregation logic, Redis-based session registry support, and metrics HTTP endpoint. Updated CMD to serve instead of verbose execution. Updated dependencies to include redis, getstream, and prometheus-client.
Helm Chart Configuration
Chart.yaml, values.yaml
Added Redis as a Helm dependency (~25.3.0). Introduced comprehensive Helm values for containerPort, service, ingress, redis, metrics, grafana, and API secrets.
Helm Deployment Template
deployment.yaml
Added container ports specification, conditional Redis environment variable, liveness and readiness probes for HTTP health checks, GPU-aware resource allocation, and streamlined nodeSelector and tolerations handling.
Helm Kubernetes Resources
service.yaml, ingress.yaml, servicemonitor.yaml, grafana-dashboard.yaml
Added four new templates: Service routing traffic to the deployment, Ingress for external HTTP access, ServiceMonitor for Prometheus scraping at /metrics endpoint, and Grafana dashboard ConfigMap displaying nine metric panels (sessions, latencies, tokens, CPU/memory usage).

Sequence Diagram

sequenceDiagram
    participant App as deploy_example.py
    participant Metrics as Prometheus Client
    participant Redis as Redis Registry
    participant Endpoint as /metrics Endpoint
    participant Prometheus as Prometheus
    participant Grafana as Grafana
    
    App->>Redis: Register session (if REDIS_URL set)
    App->>App: Collect session metrics from registry
    App->>Metrics: Update Gauge metrics (latencies, tokens, etc.)
    Prometheus->>Endpoint: Scrape /metrics (via ServiceMonitor)
    Endpoint->>Metrics: generate_latest()
    Endpoint->>Prometheus: Return metric exposition format
    Prometheus->>Prometheus: Store time-series data
    Grafana->>Prometheus: Query metrics (PromQL)
    Grafana->>Grafana: Render dashboard panels
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

Bell jar descends on metrics bright—
each gauge a breath, a measured blight,
Redis pools the sessions deep,
while Prometheus counts what agents reap.
Grafana's dashboard pierces through,
observing all we dare not do. 📊

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: converting the deploy example into a production-ready Helm chart with infrastructure enhancements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/deploy-example-helm-chart

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
examples/07_deploy_example/deploy_example.py (2)

115-119: Move Redis session-registry imports to module scope.

Line 116 introduces local imports in runtime flow; keep imports at the top of the file.

🔧 Proposed fix
@@
 from fastapi import Response
 from prometheus_client import Gauge, generate_latest, CONTENT_TYPE_LATEST
 from vision_agents.core import Agent, AgentLauncher, Runner, User
+from vision_agents.core.agents.session_registry import RedisSessionKVStore, SessionRegistry
@@
 if redis_url:
-    from vision_agents.core.agents.session_registry import (
-        RedisSessionKVStore,
-        SessionRegistry,
-    )
-
     store = RedisSessionKVStore(url=redis_url)
     registry = SessionRegistry(store=store)
As per coding guidelines, "Do not use local imports; import all modules at the top of the file."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/deploy_example.py` around lines 115 - 119, The
conditional imports for RedisSessionKVStore and SessionRegistry are done inside
runtime flow (if redis_url:) which violates the guideline to keep imports at
module scope; move the import statements for RedisSessionKVStore and
SessionRegistry to the top-level imports of the module (module scope) so they
are imported on module load rather than inside the conditional, and keep the
existing runtime conditional logic that instantiates or uses
RedisSessionKVStore/SessionRegistry unchanged.

26-27: Remove decorative section-divider comments to match repo Python style.

Line 26, Line 79, and Line 110 use divider-style section comments; please replace with plain comments or remove them.

As per coding guidelines, "Do not use section comments like # -- some section --."

Also applies to: 79-80, 110-111

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/deploy_example.py` around lines 26 - 27, Remove
the decorative section-divider comments (e.g., the lines like "# ── Prometheus
metrics ──────────────────────────────────────────────") to conform to repo
Python style; replace them with plain single-line comments or delete them where
found (the Prometheus metrics header and the two other divider comments later in
the file) so only normal comments remain and no section-divider art is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/07_deploy_example/deploy_example.py`:
- Around line 59-65: Replace the blanket "except Exception:" catch blocks around
the metrics collection logic (the try that calls
s.agent.metrics.to_dict().get(key) and the similar block later) with a narrow
exception tuple (AttributeError, TypeError, ValueError) and log the caught
exception with logger.exception(...) so tracebacks are recorded; specifically,
change the catch for the s.agent.metrics.to_dict() access and the other metrics
parsing try/except to "except (AttributeError, TypeError, ValueError) as e:" and
call logger.exception("... contextual message ...", exc_info=e) before
continuing, leaving the gauge.set(...) logic unchanged.

In `@examples/07_deploy_example/Dockerfile`:
- Line 18: Change the Docker CMD so the final server process replaces PID 1 (so
SIGTERM reaches it): either add an explicit exec before the server command in
the existing shell invocation (so the uv run process is exec'd), or replace the
CMD with a small entrypoint script that runs "uv sync --frozen" then execs "uv
run deploy_example.py serve --host 0.0.0.0 --port 8080"; update the CMD line
that currently wraps "uv sync" and "uv run" in "sh -c" to use one of these
approaches so the uv run process receives signals directly.

In `@examples/07_deploy_example/helm/templates/grafana-dashboard.yaml`:
- Line 135: The dashboard uses a fixed Grafana UID ("uid":
"vision-agent-overview") which can collide across releases; change it to a
release-unique value (e.g., derive from .Release.Name and/or .Release.Namespace,
or append a short random suffix) or remove the UID so Grafana generates one.
Update the template that emits the "uid" field (the entry with "uid":
"vision-agent-overview") to interpolate a unique identifier using Helm template
helpers like .Release.Name, .Release.Namespace, or a generated token (or omit
the field entirely).
- Around line 82-97: Replace the hardcoded pod regex ".*vision-agent.*" with a
Helm-scoped regex that includes the current release name; update the two PromQL
targets (the CPU target expression in the timeseries panel with refId "A" and
the "Pod Memory Usage" target expression) to use the release variable (e.g.
pod=~"{{ .Release.Name }}.*vision-agent.*" or similar templated pattern) so the
queries only match pods from the current release instead of all pods matching
"vision-agent".

In `@examples/07_deploy_example/helm/templates/servicemonitor.yaml`:
- Line 1: The template only checks .Values.metrics.enabled and should also
verify the ServiceMonitor CRD exists to avoid install failures on clusters
without monitoring.coreos.com/v1; update the conditional around the
ServiceMonitor manifest in servicemonitor.yaml to require both
.Values.metrics.enabled and a successful lookup call (e.g.
lookup("apiextensions.k8s.io/v1","CustomResourceDefinition","","servicemonitors.monitoring.coreos.com")
) so the resource is rendered only when the CRD is present.

In `@examples/07_deploy_example/helm/values.yaml`:
- Around line 45-47: The Ingress is enabled by default with an empty host
causing a catch-all rule; update the Helm values to set enabled: false by
default (change the values.yaml key "enabled" to false) and add validation in
the chart templates/notes to only render/create ingress when
.Values.ingress.enabled is true and .Values.ingress.host is non-empty (check
.Values.ingress.host in the template or use a required() helper) so ingress is
only created when host/TLS are explicitly provided; reference the "enabled",
"className", and "host" keys when making these changes.
- Around line 58-66: The bundled Redis is enabled by default and has auth
disabled (redis.deploy.enabled / auth.enabled), which is insecure; update the
Helm values to require authentication by default by setting auth.enabled to true
(and provide guidance to set a secure password or use a managed URL via the url
field when disabling redis.deploy.enabled), ensuring the redis.auth.enabled and
redis.deploy.enabled/url settings are consistent so in-cluster Redis is not left
unauthenticated.

---

Nitpick comments:
In `@examples/07_deploy_example/deploy_example.py`:
- Around line 115-119: The conditional imports for RedisSessionKVStore and
SessionRegistry are done inside runtime flow (if redis_url:) which violates the
guideline to keep imports at module scope; move the import statements for
RedisSessionKVStore and SessionRegistry to the top-level imports of the module
(module scope) so they are imported on module load rather than inside the
conditional, and keep the existing runtime conditional logic that instantiates
or uses RedisSessionKVStore/SessionRegistry unchanged.
- Around line 26-27: Remove the decorative section-divider comments (e.g., the
lines like "# ── Prometheus metrics
──────────────────────────────────────────────") to conform to repo Python
style; replace them with plain single-line comments or delete them where found
(the Prometheus metrics header and the two other divider comments later in the
file) so only normal comments remain and no section-divider art is used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9bc5d131-7d74-4466-9caa-9975a4010837

📥 Commits

Reviewing files that changed from the base of the PR and between 57c892b and 2ea816a.

⛔ Files ignored due to path filters (1)
  • examples/07_deploy_example/helm/Chart.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • examples/07_deploy_example/Dockerfile
  • examples/07_deploy_example/deploy_example.py
  • examples/07_deploy_example/helm/Chart.yaml
  • examples/07_deploy_example/helm/templates/deployment.yaml
  • examples/07_deploy_example/helm/templates/grafana-dashboard.yaml
  • examples/07_deploy_example/helm/templates/ingress.yaml
  • examples/07_deploy_example/helm/templates/service.yaml
  • examples/07_deploy_example/helm/templates/servicemonitor.yaml
  • examples/07_deploy_example/helm/values.yaml
  • examples/07_deploy_example/pyproject.toml

Comment on lines +59 to +65
try:
v = s.agent.metrics.to_dict().get(key)
if v is not None:
values.append(v)
except Exception:
continue
gauge.set(sum(values) / len(values) / 1000 if values else 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n examples/07_deploy_example/deploy_example.py | head -100

Repository: GetStream/Vision-Agents

Length of output: 4262


Catch specific exceptions in metrics collection instead of blanket Exception handling.

Lines 63 and 74 catch all exceptions without logging, which hides real defects in the metrics collection path and violates coding guidelines. Replace except Exception: with specific exception types (AttributeError, TypeError, ValueError) and log the error using logger.exception() for observability.

Proposed fix
     for key, gauge in _AVG_METRICS:
         values = []
         for s in sessions.values():
             try:
                 v = s.agent.metrics.to_dict().get(key)
                 if v is not None:
                     values.append(v)
-            except Exception:
+            except (AttributeError, TypeError, ValueError):
+                logger.exception("Failed to collect average metric for key '%s'", key)
                 continue
         gauge.set(sum(values) / len(values) / 1000 if values else 0)

     for key, gauge in _SUM_METRICS:
         total = 0
         for s in sessions.values():
             try:
                 v = s.agent.metrics.to_dict().get(key)
                 if v is not None:
                     total += int(v)
-            except Exception:
+            except (AttributeError, TypeError, ValueError):
+                logger.exception("Failed to collect sum metric for key '%s'", key)
                 continue
         gauge.set(total)

Per coding guidelines: "Never write except Exception as e; catch specific exceptions instead" and "Use logger.exception() when logging an error with a traceback."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/deploy_example.py` around lines 59 - 65, Replace
the blanket "except Exception:" catch blocks around the metrics collection logic
(the try that calls s.agent.metrics.to_dict().get(key) and the similar block
later) with a narrow exception tuple (AttributeError, TypeError, ValueError) and
log the caught exception with logger.exception(...) so tracebacks are recorded;
specifically, change the catch for the s.agent.metrics.to_dict() access and the
other metrics parsing try/except to "except (AttributeError, TypeError,
ValueError) as e:" and call logger.exception("... contextual message ...",
exc_info=e) before continuing, leaving the gauge.set(...) logic unchanged.


# Install deps at startup (faster push, slower cold start)
CMD ["sh", "-c", "uv sync --frozen -v && uv run deploy_example.py run"]
CMD ["sh", "-c", "uv sync --frozen && uv run deploy_example.py serve --host 0.0.0.0 --port 8080"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n examples/07_deploy_example/Dockerfile

Repository: GetStream/Vision-Agents

Length of output: 650


🌐 Web query:

Docker exec signal handling PID 1 graceful shutdown SIGTERM

💡 Result:

  • Only PID 1 in the container gets the “stop” signal. docker stop <ctr> sends SIGTERM to the container’s main process (PID 1), waits a grace period, then sends SIGKILL if it didn’t exit. You can change the first signal (STOPSIGNAL / --stop-signal) and the grace period (--timeout / --stop-timeout). [1]
  • docker exec is not a shutdown mechanism. It starts an additional process; signals you send to that exec’d process don’t automatically reach PID 1. For graceful shutdown, signal PID 1 (usually via docker stop, or docker kill --signal=SIGTERM <ctr>). [1]
  • Common reason SIGTERM “doesn’t work”: PID 1 is a shell wrapper. If you use the shell form of ENTRYPOINT/CMD, Docker runs it under /bin/sh -c, which does not pass signals, so your real app won’t receive SIGTERM from docker stop. Use exec form so your app is PID 1. [2]
  • Linux PID 1 is special: a process running as PID 1 won’t terminate on SIGINT/SIGTERM unless it’s coded to do so (i.e., it must handle the signal or you must have an init/signal-forwarder). [3]
  • Fix options:
    1. Use exec-form ENTRYPOINT/CMD (and in any entrypoint script, end with exec "$@") so the app becomes PID 1 and can receive SIGTERM. [2]
    2. Add an init to forward signals/reap zombies: docker run --init … (uses Docker’s docker-init, backed by tini). [3]
    3. Configure stop signal if your app expects something else: STOPSIGNAL <signal> (default is SIGTERM). [4]

Sources: Docker stop behavior and timeouts [1]; Dockerfile shell-form ENTRYPOINT doesn’t pass signals [2]; PID 1 signal caveat + --init/tini [3]; STOPSIGNAL default and meaning [4].


Use exec in CMD so SIGTERM reaches the app process.

Line 18 starts the server through a shell that remains PID 1. When the container receives SIGTERM (from docker stop or Kubernetes shutdown), the signal goes to the shell, not the application—preventing graceful shutdown.

🔧 Proposed fix
-CMD ["sh", "-c", "uv sync --frozen && uv run deploy_example.py serve --host 0.0.0.0 --port 8080"]
+CMD ["sh", "-c", "uv sync --frozen && exec uv run deploy_example.py serve --host 0.0.0.0 --port 8080"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CMD ["sh", "-c", "uv sync --frozen && uv run deploy_example.py serve --host 0.0.0.0 --port 8080"]
CMD ["sh", "-c", "uv sync --frozen && exec uv run deploy_example.py serve --host 0.0.0.0 --port 8080"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/Dockerfile` at line 18, Change the Docker CMD so
the final server process replaces PID 1 (so SIGTERM reaches it): either add an
explicit exec before the server command in the existing shell invocation (so the
uv run process is exec'd), or replace the CMD with a small entrypoint script
that runs "uv sync --frozen" then execs "uv run deploy_example.py serve --host
0.0.0.0 --port 8080"; update the CMD line that currently wraps "uv sync" and "uv
run" in "sh -c" to use one of these approaches so the uv run process receives
signals directly.

Comment on lines +82 to +97
"expr": "rate(container_cpu_usage_seconds_total{pod=~\".*vision-agent.*\", container!=\"POD\", container!=\"\"}[5m])",
"legendFormat": "{{`{{`}} pod {{`}}`}}",
"refId": "A"
}]
},
{
"title": "Pod Memory Usage",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 },
"fieldConfig": {
"defaults": { "color": { "mode": "palette-classic" }, "unit": "bytes" },
"overrides": []
},
"targets": [{
"expr": "container_memory_working_set_bytes{pod=~\".*vision-agent.*\", container!=\"POD\", container!=\"\"}",
"legendFormat": "{{`{{`}} pod {{`}}`}}",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope pod PromQL to the current release, not .*vision-agent.*.

Line 82 and Line 96 currently match pods by a hardcoded substring, which can pull metrics from other releases and can miss pods when naming is overridden.

🔧 Suggested fix
-            "expr": "rate(container_cpu_usage_seconds_total{pod=~\".*vision-agent.*\", container!=\"POD\", container!=\"\"}[5m])",
+            "expr": "rate(container_cpu_usage_seconds_total{namespace=\"{{ .Release.Namespace }}\",pod=~\".*{{ include "vision-agent.fullname" . }}.*\", container!=\"POD\", container!=\"\"}[5m])",
...
-            "expr": "container_memory_working_set_bytes{pod=~\".*vision-agent.*\", container!=\"POD\", container!=\"\"}",
+            "expr": "container_memory_working_set_bytes{namespace=\"{{ .Release.Namespace }}\",pod=~\".*{{ include "vision-agent.fullname" . }}.*\", container!=\"POD\", container!=\"\"}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/helm/templates/grafana-dashboard.yaml` around
lines 82 - 97, Replace the hardcoded pod regex ".*vision-agent.*" with a
Helm-scoped regex that includes the current release name; update the two PromQL
targets (the CPU target expression in the timeseries panel with refId "A" and
the "Pod Memory Usage" target expression) to use the release variable (e.g.
pod=~"{{ .Release.Name }}.*vision-agent.*" or similar templated pattern) so the
queries only match pods from the current release instead of all pods matching
"vision-agent".

"templating": { "list": [] },
"time": { "from": "now-1h", "to": "now" },
"title": "Vision Agent Overview",
"uid": "vision-agent-overview"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid a globally fixed Grafana UID.

Line 135 uses a static UID, which can collide across namespaces/releases in shared Grafana environments.

🔧 Suggested fix
-      "uid": "vision-agent-overview"
+      "uid": "{{ printf "%s-%s" .Release.Namespace .Release.Name | trunc 40 }}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"uid": "vision-agent-overview"
"uid": "{{ printf "%s-%s" .Release.Namespace .Release.Name | trunc 40 }}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/helm/templates/grafana-dashboard.yaml` at line
135, The dashboard uses a fixed Grafana UID ("uid": "vision-agent-overview")
which can collide across releases; change it to a release-unique value (e.g.,
derive from .Release.Name and/or .Release.Namespace, or append a short random
suffix) or remove the UID so Grafana generates one. Update the template that
emits the "uid" field (the entry with "uid": "vision-agent-overview") to
interpolate a unique identifier using Helm template helpers like .Release.Name,
.Release.Namespace, or a generated token (or omit the field entirely).

@@ -0,0 +1,19 @@
{{- if .Values.metrics.enabled }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard ServiceMonitor creation on CRD availability.

Line 1 only checks values; installs will fail on clusters that do not have monitoring.coreos.com/v1 CRDs.

🔧 Proposed fix
-{{- if .Values.metrics.enabled }}
+{{- if and .Values.metrics.enabled (.Capabilities.APIVersions.Has "monitoring.coreos.com/v1/ServiceMonitor") }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{- if .Values.metrics.enabled }}
{{- if and .Values.metrics.enabled (.Capabilities.APIVersions.Has "monitoring.coreos.com/v1/ServiceMonitor") }}
🧰 Tools
🪛 YAMLlint (1.38.0)

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/helm/templates/servicemonitor.yaml` at line 1, The
template only checks .Values.metrics.enabled and should also verify the
ServiceMonitor CRD exists to avoid install failures on clusters without
monitoring.coreos.com/v1; update the conditional around the ServiceMonitor
manifest in servicemonitor.yaml to require both .Values.metrics.enabled and a
successful lookup call (e.g.
lookup("apiextensions.k8s.io/v1","CustomResourceDefinition","","servicemonitors.monitoring.coreos.com")
) so the resource is rendered only when the CRD is present.

Comment on lines +45 to +47
enabled: true
className: "" # nginx, traefik, etc.
host: "" # e.g. agent.example.com
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ingress is enabled by default with an empty host (wildcard exposure).

With Line 45 enabled and Line 47 empty, this commonly creates a catch-all ingress rule. Safer default is to keep ingress off until host/TLS are explicitly set.

🔧 Proposed fix
 ingress:
-  enabled: true
+  enabled: false
   className: ""   # nginx, traefik, etc.
   host: ""        # e.g. agent.example.com
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
enabled: true
className: "" # nginx, traefik, etc.
host: "" # e.g. agent.example.com
enabled: false
className: "" # nginx, traefik, etc.
host: "" # e.g. agent.example.com
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/helm/values.yaml` around lines 45 - 47, The
Ingress is enabled by default with an empty host causing a catch-all rule;
update the Helm values to set enabled: false by default (change the values.yaml
key "enabled" to false) and add validation in the chart templates/notes to only
render/create ingress when .Values.ingress.enabled is true and
.Values.ingress.host is non-empty (check .Values.ingress.host in the template or
use a required() helper) so ingress is only created when host/TLS are explicitly
provided; reference the "enabled", "className", and "host" keys when making
these changes.

Comment on lines +58 to +66
enabled: true
# External Redis URL (for production — use managed Redis)
# When redis.deploy.enabled=true and url is empty, auto-connects to the bundled Redis
# When redis.deploy.enabled=false, set this to your managed Redis URL
url: "" # e.g. redis://redis:6379/0 or rediss://:auth@host:6380/0
# Bitnami Redis chart overrides (only when redis.deploy.enabled=true)
auth:
enabled: false
architecture: standalone
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bundled Redis is unauthenticated by default.

Line 58 enables in-cluster Redis by default, while Line 65 disables auth. That weakens session-store security in shared clusters.

🔧 Proposed fix
 redis:
   deploy:
     enabled: true
@@
   auth:
-    enabled: false
+    enabled: true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
enabled: true
# External Redis URL (for production — use managed Redis)
# When redis.deploy.enabled=true and url is empty, auto-connects to the bundled Redis
# When redis.deploy.enabled=false, set this to your managed Redis URL
url: "" # e.g. redis://redis:6379/0 or rediss://:auth@host:6380/0
# Bitnami Redis chart overrides (only when redis.deploy.enabled=true)
auth:
enabled: false
architecture: standalone
enabled: true
# External Redis URL (for production — use managed Redis)
# When redis.deploy.enabled=true and url is empty, auto-connects to the bundled Redis
# When redis.deploy.enabled=false, set this to your managed Redis URL
url: "" # e.g. redis://redis:6379/0 or rediss://:auth@host:6380/0
# Bitnami Redis chart overrides (only when redis.deploy.enabled=true)
auth:
enabled: true
architecture: standalone
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/07_deploy_example/helm/values.yaml` around lines 58 - 66, The
bundled Redis is enabled by default and has auth disabled (redis.deploy.enabled
/ auth.enabled), which is insecure; update the Helm values to require
authentication by default by setting auth.enabled to true (and provide guidance
to set a secure password or use a managed URL via the url field when disabling
redis.deploy.enabled), ensuring the redis.auth.enabled and
redis.deploy.enabled/url settings are consistent so in-cluster Redis is not left
unauthenticated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant