From 23c53dbbd7d3aa7b977e5dc5539884ff17746730 Mon Sep 17 00:00:00 2001 From: sametd Date: Wed, 11 Feb 2026 15:07:09 +0100 Subject: [PATCH 1/2] Add ECMWF observability guidelines for logging and metrics --- Observability/observability-guidelines.md | 471 ++++++++++++++++++++++ README.md | 1 + 2 files changed, 472 insertions(+) create mode 100644 Observability/observability-guidelines.md diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md new file mode 100644 index 0000000..b1b581c --- /dev/null +++ b/Observability/observability-guidelines.md @@ -0,0 +1,471 @@ +# ECMWF Observability Guidelines + +## Table of Contents + +- [1. Purpose and Scope](#1-purpose-and-scope) +- [2. Core Principles](#2-core-principles) + - [2.1 Normative Language](#21-normative-language) +- [3. Platform Context](#3-platform-context) +- [4. Logging Standard](#4-logging-standard) + - [4.1 Log Event Model](#41-log-event-model) + - [4.2 Required Fields (Minimum Contract)](#42-required-fields-minimum-contract) + - [4.3 Event Naming and Attribute Cardinality](#43-event-naming-and-attribute-cardinality) + - [4.4 Library vs Binary Application Logging](#44-library-vs-binary-application-logging) + - [4.5 Good and Bad Log Lines](#45-good-and-bad-log-lines) + - [4.6 Severity and Event Design](#46-severity-and-event-design) + - [4.7 Exception and Error Logging](#47-exception-and-error-logging) + - [4.8 Safety and Compliance Rules](#48-safety-and-compliance-rules) + - [4.9 Common Anti-Patterns](#49-common-anti-patterns) + - [4.10 Validation Checklist and Ownership](#410-validation-checklist-and-ownership) +- [5. Metrics Standard](#5-metrics-standard) + - [5.1 Scope and Standard](#51-scope-and-standard) + - [5.2 References](#52-references) + - [5.3 Metric Types and Usage](#53-metric-types-and-usage) + - [5.4 Naming Conventions](#54-naming-conventions) + - [5.5 Labels and Cardinality](#55-labels-and-cardinality) + - [5.6 Required Baseline Metrics](#56-required-baseline-metrics) + - [5.7 Histogram Guidance](#57-histogram-guidance) + - [5.8 Good and Bad Metric Examples](#58-good-and-bad-metric-examples) + - [5.9 Validation Checklist and Ownership](#59-validation-checklist-and-ownership) + +## 1. Purpose and Scope + +This document defines the ECMWF baseline for observability across software and services. + +Current scope: + +- Defines common expectations for observability signals. +- Defines logging and metrics standardisation. +- Covers all deployment contexts at a principle level: + - Kubernetes + - Virtual machines (VMs) + - HPC + +Out of scope in this version: + +- Detailed environment-specific collection pipelines and agent deployment patterns. +- Full tracing specification (to be defined in a later revision). + +## 2. Core Principles + +- Use consistent observability conventions across all ECMWF software. +- Prefer machine-parseable telemetry over free-form text. +- Keep telemetry actionable and low-noise. +- Correlate signals where possible (for example, include trace/span identifiers in logs when available). +- Protect sensitive data by design (no credentials, tokens, or personal data in logs/metrics/traces). + +### 2.1 Normative Language + +The keywords `MUST`, `SHOULD`, and `MAY` are used as requirement levels: + +- `MUST`: mandatory requirement for compliance. +- `SHOULD`: recommended default; deviations should be justified. +- `MAY`: optional behavior. + +## 3. Platform Context + +ECMWF software runs in multiple environments: + +- Kubernetes clusters +- Virtual machines +- HPC systems + +This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, and HPC will be specified later. + +## 4. Logging Standard + +ECMWF software should emit structured logs aligned with the OpenTelemetry log data model. + +Useful references: + +- OpenTelemetry logs data model: +- OpenTelemetry semantic conventions: + +### 4.1 Log Event Model + +Each log record should contain: + +- A clear event message (`body` / message). +- Severity (`severity_text`, `severity_number`). +- Timestamp in UTC. +- Stable resource attributes (service and environment metadata). +- Context attributes for debugging and operations. + +Canonical structure (OpenTelemetry-aligned): + +```json +{ + "timestamp": "2026-02-11T12:20:43Z", + "severity_text": "INFO", + "severity_number": 9, + "body": "Operation completed", + "resource": { + "service.name": "example-service", + "service.version": "1.0.0", + "deployment.environment": "prod", + "k8s.namespace.name": "default", + "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd" + }, + "attributes": { + "event.name": "operation.completed", + "request.id": "req-8f31c9", + "job.id": "job-42a7", + "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93", + "span_id": "f9c3a29d03ef154f" + } +} +``` + +### 4.2 Required Fields (Minimum Contract) + +All production logs MUST include the following fields. + +Application-emitted fields: + +| Field | Requirement | Notes | +| --- | --- | --- | +| `timestamp` | MUST | UTC, RFC 3339 / ISO-8601 format | +| `severity_text` | MUST | `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL` | +| `severity_number` | MUST | Numeric OTel-compatible severity | +| `body` | MUST | Human-readable message describing one event | +| `service.name` | MUST | Logical service/application name | +| `service.version` | MUST | Deployed version/build identifier | +| `deployment.environment` | MUST | e.g. `dev`, `test`, `staging`, `prod` | +| `trace_id` | MUST when available | Enables log-trace correlation | +| `span_id` | MUST when available | Enables log-trace correlation | + +Collector-enriched or infrastructure fields: + +| Field | Requirement | Notes | +| --- | --- | --- | +| `host.name` | MUST (VM/HPC context) | May be emitted by app or added by collector/resource detection | +| `k8s.namespace.name` | MUST (K8s context) | May be added at collection layer | +| `k8s.pod.name` | MUST (K8s context) | May be added at collection layer | + +Recommended additional fields: + +- `event.name` (stable event type) +- `event.domain` (component/domain group) +- `error.type` and `error.message` for failures +- Request/work item identifiers (for example `request.id`, `job.id`) + +### 4.3 Event Naming and Attribute Cardinality + +Event naming convention: + +- Use `event.name` in the form `domain.action.result`. +- Use lowercase with `.` separators. +- Keep names stable over time. +- If an event meaning changes materially, create a new event name. + +Examples: + +- `operation.completed` +- `operation.failed` + +Attribute cardinality guidance: + +- Low to medium cardinality fields are preferred for repeated events. +- Request/job identifiers are allowed for correlation. +- Do not create dynamic field names. +- Do not move arbitrary payloads into attributes. +- Large free-text content SHOULD stay in `body` only when necessary. + +### 4.4 Library vs Binary Application Logging + +#### Libraries + +- MUST not configure global logging policy. +- MUST use the application-provided logger interface. +- MUST emit structured fields, not only formatted strings. +- MUST not log secrets or large payloads. +- SHOULD avoid excessive `INFO`/`DEBUG` logs in hot code paths. +- SHOULD include stable event names for reusable log points: + - Example: `event.name="library.decode.failed"` + - Avoid changing field keys between library versions without migration notes. + +#### Binary Applications / Services + +- MUST own logger initialisation and runtime configuration. +- MUST enforce structured JSON output compatible with OTel pipelines. +- MUST add resource context at startup (`service.*`, environment, runtime metadata). +- MUST define log level policy by environment. +- SHOULD control repetitive low-value log volume. +- MUST implement redaction/masking filters before emission. +- SHOULD ensure resource attributes are complete: + - `service.name`, `service.version`, `deployment.environment` + - Runtime and infrastructure attributes when available + +### 4.5 Good and Bad Log Lines + +Good log line characteristics: + +- Structured key/value format. +- One clear event per line. +- Includes identifiers and outcome. +- Uses stable field names. +- Supports correlation: + - Include `trace_id` and `span_id` when context exists. + - Include request/job identifiers when available. + +Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency. + +Bad log line characteristics: + +- Free-form text without structure. +- Missing context or identifiers. +- Ambiguous message content. +- Includes sensitive information. +- Breaks schema consistency: + - Changes field names for the same event type. + - Encodes structured data only inside a message string. + +Good example: + +```json +{ + "timestamp": "2026-02-11T12:20:43Z", + "severity_text": "INFO", + "severity_number": 9, + "body": "Operation completed", + "resource": { + "service.name": "example-service", + "service.version": "1.0.0", + "deployment.environment": "prod", + "k8s.namespace.name": "default", + "k8s.pod.name": "example-service-7f8b66f9f7-rj8vd" + }, + "attributes": { + "event.name": "operation.completed", + "request.id": "req-8f31c9", + "job.id": "job-42a7", + "trace_id": "7f3fbbf5b8f24f32a59ec8ef9b264f93", + "span_id": "f9c3a29d03ef154f" + } +} +``` + +Bad example: + +```text +done request ok +``` + +Bad example (sensitive data leak): + +```text +Login failed for user alice password=PlainTextSecret token=eyJhbGci... +``` + +### 4.6 Severity and Event Design + +- `DEBUG`: development diagnostics and verbose internals. +- `INFO`: normal lifecycle and business-relevant state changes. +- `WARN`: unexpected but recoverable conditions. +- `ERROR`: failed operation requiring attention. +- `FATAL`: unrecoverable condition before shutdown. + +Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason. + +For severity mapping guidance, follow OpenTelemetry severity concepts in the logs data model reference. + +### 4.7 Exception and Error Logging + +- Log an exception once at the handling boundary. +- Avoid duplicate logging of the same error in multiple layers. +- Include stack traces when they materially improve diagnosis. +- Sanitize stack traces and exception messages before emission. +- Include `error.type` and `error.message` for failed operations. + +### 4.8 Safety and Compliance Rules + +- MUST never log secrets, credentials, session tokens, private keys, or personal data. +- MUST redact sensitive substrings before writing log output. +- SHOULD avoid full object dumps unless explicitly sanitized. +- SHOULD include stack traces for errors only when useful and sanitized. +- SHOULD define deny-lists and redaction rules centrally: + - Authentication headers and bearer tokens + - Passwords, API keys, secrets + - User personal data fields + +### 4.9 Common Anti-Patterns + +| Anti-pattern | Why it is harmful | Preferred pattern | +| --- | --- | --- | +| Free-text logs only | Hard to parse, search, and alert | Structured JSON with stable keys | +| Dynamic field names | Breaks queries and dashboards | Stable schema and key names | +| Logging in tight loops at `INFO` | Noise and cost explosion | Reduce frequency and log only meaningful state changes | +| Duplicate exception logs across layers | Inflates incident noise | Log once at handling boundary | +| Logging secrets/tokens | Security and compliance risk | Redaction and explicit deny-lists | + +### 4.10 Validation Checklist and Ownership + +Before release, teams should verify: + +- Required fields are present in production logs. +- Log output is valid structured JSON. +- Secrets and sensitive data are redacted. +- Library and binary responsibilities are correctly separated. +- Severity levels are used consistently. +- Correlation fields (`trace_id`, `span_id`) are present when tracing context exists. + +Ownership split for compliance: + +| Control | App Team | Platform Team | +| --- | --- | --- | +| Structured JSON emitted by app | MUST | N/A | +| Required app fields (`service.name`, `service.version`, `deployment.environment`, `body`, severity) | MUST | Validate only | +| Secret redaction in app logs | MUST | SHOULD add defensive redaction in pipeline | +| `k8s.namespace.name`, `k8s.pod.name`, `host.name` enrichment | MAY | MUST where collector supports it | +| Log transport to backend (for example Splunk) | N/A | MUST | +| Parsing/schema validation in collector | N/A | SHOULD | +| Log noise and volume control | SHOULD at source | SHOULD as safety net | + +## 5. Metrics Standard + +Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format. +ECMWF services MUST use Prometheus metric types and naming conventions, and MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. +Metrics defined in this section are the source for alerting rules defined in the Alerting section. + +### 5.1 Scope and Standard + +- This section defines instrumentation expectations, metric schema, and quality requirements. +- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC are specified separately. + +### 5.2 References + +- Prometheus metric types: +- Prometheus naming best practices: +- OpenMetrics specification: + +### 5.3 Metric Types and Usage + +- `Counter`: + - MUST be monotonic. + - MUST use `_total` suffix. + - Use for counts of events and outcomes. +- `Gauge`: + - Use for values that increase and decrease (for example in-flight operations). +- `Histogram`: + - SHOULD be used for latency and size distributions. + - MUST have stable bucket boundaries for the same metric across instances. +- `Summary`: + - SHOULD be avoided for cross-instance aggregation use cases. + - MAY be used only with clear justification. + +### 5.4 Naming Conventions + +- Metric names MUST be lowercase `snake_case`. +- Metric names MUST include base units where applicable: + - `_seconds` for duration + - `_bytes` for size + - `_total` for counters +- Metric names SHOULD be stable over time. +- If a name must change, introduce the new metric and deprecate the old one before removal. + +Good naming examples: + +- `http_server_requests_total` +- `http_server_request_duration_seconds` +- `job_execution_duration_seconds` +- `process_resident_memory_bytes` + +Bad naming examples: + +- `HttpRequests` +- `requestDurationMs` +- `errors` + +### 5.5 Labels and Cardinality + +Labels add dimensionality to metrics but increase cardinality. + +- Labels MUST use stable keys and bounded value sets. +- Labels SHOULD describe dimensions such as: + - `service` + - `environment` + - `operation` + - `status` +- Labels MUST NOT include unbounded identifiers such as: + - `request_id` + - `user_id` + - Raw URLs with path parameters + - UUIDs or timestamps +- Label values SHOULD be normalized: + - Prefer route templates (for example `/api/v1/items/{id}`) over raw paths. + - Prefer status classes (`2xx`, `4xx`, `5xx`) when detail is not required. + +### 5.6 Required Baseline Metrics + +Application and service metrics: + +- Request/operation throughput counter + - Example: `service_requests_total` +- Request/operation failure counter + - Example: `service_request_failures_total` +- Request/operation duration histogram + - Example: `service_request_duration_seconds` +- In-flight operation gauge (if applicable) + - Example: `service_requests_in_flight` + +Runtime/process metrics (where runtime supports them): + +- CPU usage +- Memory usage +- Uptime/start time +- Runtime-specific health metrics (for example GC metrics) + +Batch/HPC job metrics (where applicable): + +- Job execution count by outcome +- Job execution duration +- Queue/wait duration + +### 5.7 Histogram Guidance + +- Histogram bucket boundaries SHOULD align with SLO/SLA objectives. +- Bucket sets MUST remain consistent for the same metric across services and versions. +- Bucket count SHOULD be limited to a practical set to control cost and query complexity. + +Example bucket set for service latency metric: + +- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, `10` seconds + +### 5.8 Good and Bad Metric Examples + +Good examples: + +```text +service_requests_total{service="example-service",environment="prod",operation="create",status="2xx"} 12842 +service_request_duration_seconds_bucket{service="example-service",environment="prod",operation="create",le="0.5"} 12011 +service_request_duration_seconds_sum{service="example-service",environment="prod",operation="create"} 3184.22 +service_request_duration_seconds_count{service="example-service",environment="prod",operation="create"} 12842 +``` + +Bad examples: + +```text +requests{request_id="d9fd0f7a-3d8e-4c17-9d8b-9b57f43dc40e",user_id="483992"} 1 +requestDurationMs{path="/api/v1/items/123456"} 187 +``` + +### 5.9 Validation Checklist and Ownership + +Before release, teams should verify: + +- Metric names, units, and suffixes are compliant. +- Required baseline metrics are present. +- Label keys and values are bounded and normalized. +- No high-cardinality identifiers are emitted as labels. +- Histogram buckets are defined and justified. + +Ownership split for compliance: + +| Control | App Team | Platform Team | +| --- | --- | --- | +| Instrument required baseline metrics | MUST | N/A | +| Naming and unit compliance | MUST | SHOULD validate | +| Label cardinality discipline | MUST | SHOULD enforce guardrails | +| Scrape/discovery pipeline configuration | N/A | MUST | +| Central metric relabeling and hygiene checks | N/A | SHOULD | +| Cost and cardinality monitoring at platform level | N/A | SHOULD | diff --git a/README.md b/README.md index 5cb47c9..03798a2 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ The Codex is a set of guidelines for development of software and services at ECM - [Project Maturity](./Project%20Maturity) - [Containerisation](./Containerisation) - [Testing](./Testing) +- [Observability](./Observability) - [ECMWF Software EnginE (ESEE)](./ESEE) - [Contributing to External Projects](./Contributing%20Externally/) - [Incoming External Contributions](./External%20Contributions/) From 7ff4198d166e74d204b2e5ac612c17bd2c683399 Mon Sep 17 00:00:00 2001 From: sametd Date: Wed, 11 Feb 2026 15:33:24 +0100 Subject: [PATCH 2/2] linting and wrapping --- Observability/observability-guidelines.md | 50 ++++++++++++++++------- 1 file changed, 35 insertions(+), 15 deletions(-) diff --git a/Observability/observability-guidelines.md b/Observability/observability-guidelines.md index b1b581c..3ecaa61 100644 --- a/Observability/observability-guidelines.md +++ b/Observability/observability-guidelines.md @@ -1,5 +1,9 @@ # ECMWF Observability Guidelines + + ## Table of Contents - [1. Purpose and Scope](#1-purpose-and-scope) @@ -51,8 +55,10 @@ Out of scope in this version: - Use consistent observability conventions across all ECMWF software. - Prefer machine-parseable telemetry over free-form text. - Keep telemetry actionable and low-noise. -- Correlate signals where possible (for example, include trace/span identifiers in logs when available). -- Protect sensitive data by design (no credentials, tokens, or personal data in logs/metrics/traces). +- Correlate signals where possible (for example, include trace/span + identifiers in logs when available). +- Protect sensitive data by design (no credentials, tokens, or personal data + in logs/metrics/traces). ### 2.1 Normative Language @@ -70,11 +76,14 @@ ECMWF software runs in multiple environments: - Virtual machines - HPC systems -This document focuses on common logs and metrics structure plus application emission rules. Environment-specific collection design for Kubernetes, VMs, and HPC will be specified later. +This document focuses on common logs and metrics structure plus application +emission rules. Environment-specific collection design for Kubernetes, VMs, +and HPC will be specified later. ## 4. Logging Standard -ECMWF software should emit structured logs aligned with the OpenTelemetry log data model. +ECMWF software should emit structured logs aligned with the OpenTelemetry +log data model. Useful references: @@ -208,7 +217,8 @@ Good log line characteristics: - Include `trace_id` and `span_id` when context exists. - Include request/job identifiers when available. -Examples below use the same canonical structure as Section 4.1 (`resource` and `attributes`) for consistency. +Examples below use the same canonical structure as Section 4.1 (`resource` +and `attributes`) for consistency. Bad log line characteristics: @@ -265,9 +275,11 @@ Login failed for user alice password=PlainTextSecret token=eyJhbGci... - `ERROR`: failed operation requiring attention. - `FATAL`: unrecoverable condition before shutdown. -Use stable event names (`event.name`) where possible, and make messages explicit about outcome, target, and reason. +Use stable event names (`event.name`) where possible, and make messages +explicit about outcome, target, and reason. -For severity mapping guidance, follow OpenTelemetry severity concepts in the logs data model reference. +For severity mapping guidance, follow OpenTelemetry severity concepts in the +logs data model reference. ### 4.7 Exception and Error Logging @@ -279,7 +291,8 @@ For severity mapping guidance, follow OpenTelemetry severity concepts in the log ### 4.8 Safety and Compliance Rules -- MUST never log secrets, credentials, session tokens, private keys, or personal data. +- MUST never log secrets, credentials, session tokens, private keys, or + personal data. - MUST redact sensitive substrings before writing log output. - SHOULD avoid full object dumps unless explicitly sanitized. - SHOULD include stack traces for errors only when useful and sanitized. @@ -324,19 +337,24 @@ Ownership split for compliance: ## 5. Metrics Standard Metrics MUST be exposed in Prometheus/OpenMetrics-compatible format. -ECMWF services MUST use Prometheus metric types and naming conventions, and MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. -Metrics defined in this section are the source for alerting rules defined in the Alerting section. +ECMWF services MUST use Prometheus metric types and naming conventions, and +MUST expose metrics in a Prometheus/OpenMetrics-compatible text format. +Metrics defined in this section are the source for alerting rules defined in +the Alerting section. ### 5.1 Scope and Standard -- This section defines instrumentation expectations, metric schema, and quality requirements. -- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC are specified separately. +- This section defines instrumentation expectations, metric schema, and + quality requirements. +- Environment-specific scrape/discovery designs for Kubernetes, VMs, and HPC + are specified separately. ### 5.2 References - Prometheus metric types: - Prometheus naming best practices: -- OpenMetrics specification: +- OpenMetrics specification: + ### 5.3 Metric Types and Usage @@ -361,7 +379,8 @@ Metrics defined in this section are the source for alerting rules defined in the - `_bytes` for size - `_total` for counters - Metric names SHOULD be stable over time. -- If a name must change, introduce the new metric and deprecate the old one before removal. +- If a name must change, introduce the new metric and deprecate the old one + before removal. Good naming examples: @@ -429,7 +448,8 @@ Batch/HPC job metrics (where applicable): Example bucket set for service latency metric: -- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, `10` seconds +- `0.005`, `0.01`, `0.025`, `0.05`, `0.1`, `0.25`, `0.5`, `1`, `2.5`, `5`, + `10` seconds ### 5.8 Good and Bad Metric Examples