Add Prometheus/Grafana observability for DynamoDB and Raft and normal… by bootjp · Pull Request #340 · bootjp/elastickv

bootjp · 2026-03-09T14:42:53Z

…ize legacy GSI snapshots for compatibility

gemini-code-assist · 2026-03-09T14:43:25Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the observability of Elastickv by integrating Prometheus and Grafana. It introduces detailed metrics for DynamoDB-compatible API operations and Raft consensus, allowing users to monitor performance, identify bottlenecks, and understand cluster health. Additionally, it includes a critical compatibility fix for legacy GSI schema snapshots, ensuring smooth operation with existing data.

Highlights

Prometheus/Grafana Observability: Integrated Prometheus and Grafana for comprehensive monitoring of Elastickv, covering both DynamoDB-compatible API requests and Raft cluster state. This includes detailed metrics for request rates, latencies, error types, request/response sizes, and per-table item counts for DynamoDB, as well as Raft leader identity, member status, commit/applied indices, and leader contact lag.
DynamoDB Metrics Instrumentation: Added extensive instrumentation to the DynamoDB adapter to capture detailed metrics for each API request. This involves new request and response recorders, context propagation for metrics state, and explicit observation calls within various DynamoDB handler functions for read, write, and table operations.
Raft Cluster Metrics: Implemented a Raft observer that periodically polls the state of Raft groups and exposes metrics such as local node state (follower, leader, candidate), leader identity, current cluster members, and replication progress (commit and applied indices).
Legacy GSI Snapshot Normalization: Introduced logic to normalize legacy Global Secondary Index (GSI) snapshots, ensuring compatibility with newer schema formats. This specifically addresses older snapshots that did not explicitly store projection types for GSIs, defaulting them to 'ALL' projection behavior.
Monitoring Infrastructure Setup: Provided Docker Compose configurations and Grafana dashboards under a new monitoring/ directory, enabling easy setup and visualization of the newly exposed Prometheus metrics for local development and multi-node deployments.

Changelog

README.md
- Added a new 'Metrics and Grafana' section detailing how to enable and use Prometheus metrics, including instructions for the local demo setup.
- Updated the 'Example Usage' section to include --metricsAddress for exposing metrics on a dedicated port.
adapter/dynamodb.go
- Imported the new monitoring package.
- Introduced DynamoDBServerOption and WithDynamoDBRequestObserver to allow injecting a metrics observer into the DynamoDB server.
- Added dynamoRequestMetricsContextKey, dynamoRequestMetricsState, countingReadCloser, and dynamoResponseRecorder types to facilitate request and response size/status tracking.
- Modified the NewDynamoDBServer constructor to accept DynamoDBServerOption for metrics integration.
- Refactored the handle method to observe in-flight requests, record request/response sizes, durations, and outcomes using the requestObserver.
- Implemented UnmarshalJSON for dynamoGlobalSecondaryIndex to normalize legacy GSI JSON formats, specifically setting ProjectionType to 'ALL' if not specified.
- Integrated observeTables, observeReadMetrics, and observeWrittenItems calls into various DynamoDB API handlers (e.g., createTable, deleteTable, putItem, getItem, query, scan, batchWriteItem, transactWriteItems) to track table-specific activity.
- Added helper functions dynamoOperationName, dynamoRequestMetricsFromContext, recordTable, addTableMetrics, tableNames, tableMetrics, batchWriteTableNames, and batchWriteCommittedCounts to support metrics collection.
adapter/dynamodb_metrics_test.go
- Added a new test file to verify the correct observation of DynamoDB success metrics and conditional failures by the DynamoDBServer.
adapter/dynamodb_migration_test.go
- Imported the encoding/json package.
- Added TestDynamoDB_EnsureLegacyTableMigration_NormalizesLegacyGSIJSONFormat to confirm that legacy GSI schema snapshots are correctly normalized to include projection types.
cmd/server/demo.go
- Imported net/http and the new monitoring package.
- Added a metricsAddress flag for specifying the Prometheus metrics endpoint.
- Updated the config struct to include metricsAddress.
- Modified the run function to initialize a monitoring.Registry and start the RaftObserver and a dedicated HTTP server for metrics.
- Updated the demo cluster configurations to include unique metricsAddress for each node.
- Added metricsShutdownTask and metricsServeTask functions to manage the lifecycle of the metrics HTTP server.
docs/docker_multinode_manual_run.md
- Updated the network access requirements to include 9090/tcp for Prometheus scraping.
- Added guidance to set --metricsAddress for multi-node deployments.
- Included --metricsAddress in the example docker run commands for nodes n1 and n2.
- Updated the list of parameters to replace for n3 to n5 to include --metricsAddress.
- Added a new section '6.5) Validate Metrics' with instructions to check the Prometheus endpoint and reference monitoring examples.
main.go
- Imported net/http and the new monitoring package.
- Added raftMetricsObserveInterval constant and metricsAddr flag for the Prometheus metrics endpoint.
- Updated the runtimeServerRunner struct to include metricsAddress and a metricsRegistry.
- Modified the run function to initialize a monitoring.Registry and start the RaftObserver with the configured raftMetricsObserveInterval.
- Updated the startDynamoDBServer function signature to accept a metricsRegistry and pass the DynamoDBObserver to the adapter.NewDynamoDBServer.
- Added a raftMonitorRuntimes helper function to convert raft runtimes into a format suitable for the RaftObserver.
- Introduced startMetricsServer to launch the Prometheus HTTP endpoint.
monitoring/docker-compose.yml
- Added a new Docker Compose file to set up Prometheus and Grafana services, including volume mounts for configuration and dashboards.
monitoring/dynamodb.go
- Added a new file defining the DynamoDBRequestObserver interface and its implementation DynamoDBMetrics.
- Defined Prometheus metrics for DynamoDB API requests, including in-flight requests, total requests, table-specific requests, request duration, request/response sizes, returned/scanned/written items, user errors, system errors, and conditional check failures.
- Implemented logic to classify request outcomes (success, user error, system error, conditional check failed) and normalize operation names and table names for consistent metric labeling.
monitoring/dynamodb_test.go
- Added a new test file to verify the functionality of DynamoDBMetrics, including observing successful requests and conditional failures.
monitoring/grafana/dashboards/elastickv-cluster-overview.json
- Added a new Grafana dashboard JSON file providing an overview of Elastickv cluster metrics, including request rates, errors, latencies, in-flight requests, per-operation and per-table activity, Raft index progress, leader identity, and member status.
monitoring/grafana/provisioning/dashboards/dashboards.yml
- Added a new Grafana provisioning file to automatically load the Elastickv cluster overview dashboard.
monitoring/grafana/provisioning/datasources/prometheus.yml
- Added a new Grafana provisioning file to configure Prometheus as a data source.
monitoring/prometheus/prometheus.yml
- Added a new Prometheus configuration file to scrape metrics from Elastickv nodes running in a Docker Compose setup.
monitoring/raft.go
- Added a new file defining RaftRuntime and RaftMetrics for Prometheus metrics related to Raft consensus.
- Implemented RaftObserver to poll Raft state and configuration, exposing metrics such as local state, leader identity, member presence, leader status of members, member count, commit index, applied index, and last leader contact time.
- Included helper functions normalizeRaftState, normalizeSuffrage, parseUintMetric, and parseLastContactSeconds for metric processing.
monitoring/raft_test.go
- Added a new test file to verify the RaftObserver's ability to correctly update member metrics when members are removed.
monitoring/registry.go
- Added a new file defining Registry to manage Prometheus metrics for a single Elastickv node.
- Provides NewRegistry to create a registry with constant node-specific labels.
- Exposes Handler() for the Prometheus scrape endpoint, Gatherer() for direct metric access, DynamoDBObserver() for DynamoDB metrics, and RaftObserver() for Raft metrics.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive observability layer using Prometheus and Grafana for both DynamoDB and Raft components, which is a significant improvement for monitoring and debugging the system. The implementation is well-structured, with a new monitoring package that cleanly separates the metrics logic, and includes a fix to normalize legacy GSI snapshots. However, the current implementation is vulnerable to Denial of Service (DoS) attacks via Prometheus label cardinality explosion. Specifically, the operation and table labels are derived from untrusted user input without sufficient validation, which an attacker could exploit to cause excessive memory usage. It is recommended to implement strict validation for these labels against known operations and existing tables. Additionally, there are minor suggestions for improving error handling in the new monitoring and server code.

adapter/dynamodb.go

monitoring/dynamodb.go

main.go

monitoring/raft.go

…se failures

bootjp · 2026-03-09T15:59:59Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces detailed Prometheus metrics for DynamoDB-compatible API and the internal Raft state, significantly enhancing system observability. The implementation is generally solid. The inclusion of a pre-built Grafana dashboard, along with the necessary Docker Compose and configuration files, facilitates easy adoption of the new monitoring capabilities. Furthermore, the change to normalize legacy GSI snapshots improves backward compatibility and data consistency. The code is well-structured, thoroughly tested, and the documentation updates are clear and helpful. Consider the suggestion to simplify the table merging logic.

monitoring/dynamodb.go

bootjp · 2026-03-09T16:11:45Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces comprehensive observability features by adding Prometheus metrics for DynamoDB and Raft operations, along with a pre-configured Grafana dashboard. The implementation is well-structured within a new monitoring package, following best practices for metrics instrumentation in Go. Additionally, it includes a backward-compatibility improvement for legacy GSI snapshots. The code quality is high, with good test coverage for the new functionality. I've found one issue related to the server shutdown sequence that should be addressed.

main.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

bootjp · 2026-03-09T16:28:17Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a comprehensive monitoring solution using Prometheus and Grafana for DynamoDB and Raft components, which is a great step towards better observability. It also includes a backward-compatibility improvement for legacy GSI snapshots. The changes are extensive, adding a new monitoring package, instrumenting the dynamodb adapter, and updating server entrypoints. The implementation is well-structured, using functional options for configuration and providing new tests for the added functionality. My review identified a potential concurrency issue in the Raft metrics observer and a minor code duplication that could be refactored for better maintainability.

monitoring/raft.go

adapter/dynamodb.go

bootjp · 2026-03-09T17:43:40Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces significant observability features by adding Prometheus metrics for DynamoDB and Raft operations, along with a pre-configured Grafana dashboard, and includes a valuable fix for normalizing legacy Global Secondary Index (GSI) snapshots. While the implementation is well-structured, it introduces security risks related to denial of service and unauthorized information disclosure. Specifically, the lack of authentication on the new metrics endpoints and the use of unvalidated user-supplied strings (table names) as Prometheus labels can lead to a label cardinality explosion, causing a denial of service. It is recommended to implement authentication for the metrics server and restrict label values to a known set of authorized resources. Additionally, consider improving the robustness of the metrics collection logic.

gemini-code-assist · 2026-03-09T17:57:28Z

monitoring/dynamodb.go

+
+func (m *DynamoDBMetrics) observeTables(operation string, outcome string, tables []string, tableMetrics map[string]DynamoDBTableMetrics) {
+	for _, table := range tables {
+		m.tableRequests.WithLabelValues(operation, table, outcome).Inc()


The application uses user-supplied table names as Prometheus label values without any validation or restriction. Since the DynamoDB-compatible API (specifically the CreateTable operation) lacks authentication, an attacker can programmatically create a large number of tables with unique names. This leads to a 'label cardinality explosion' in the monitoring system (Prometheus), which can cause excessive memory consumption and potentially crash the Prometheus server or the application itself, resulting in a denial of service (DoS).

gemini-code-assist · 2026-03-09T17:57:28Z

cmd/server/demo.go

+	ms := &http.Server{
+		Handler:           metricsRegistry.Handler(),
+		ReadHeaderTimeout: time.Second,
+	}


The newly introduced metrics server is started without any authentication mechanism. This endpoint exposes internal system metrics to anyone who can reach the network port. While common in isolated environments, exposing metrics without authentication can lead to information disclosure (e.g., table names, traffic patterns) and increases the attack surface for denial-of-service attacks, such as scraping the endpoint at a high frequency or exploiting the label cardinality issue mentioned in other comments.

gemini-code-assist · 2026-03-09T17:57:28Z

main.go

+	if err := startMetricsServer(r.ctx, r.lc, r.eg, r.metricsAddress, r.metricsRegistry.Handler()); err != nil {
 		return waitErrgroupAfterStartupFailure(r.cancel, r.eg, err)
 	}


The metrics server is initialized and started without authentication. This allows unauthorized access to the Prometheus metrics endpoint, potentially leaking sensitive operational data and providing a vector for denial-of-service attacks against the monitoring infrastructure.

gemini-code-assist · 2026-03-09T17:57:28Z

adapter/dynamodb.go

+func (s *dynamoRequestMetricsState) addTableMetrics(table string, returnedItems int, scannedItems int, writtenItems int) {
+	if s == nil {
+		return
+	}
+	table = strings.TrimSpace(table)
+	if table == "" {
+		return
+	}
+	metrics, ok := s.tables[table]
+	if !ok {
+		return
+	}
+	metrics.ReturnedItems += returnedItems
+	metrics.ScannedItems += scannedItems
+	metrics.WrittenItems += writtenItems
+	s.tables[table] = metrics
+}


The current implementation of addTableMetrics will silently drop metrics if recordTable has not been called for a given table first. This makes the metrics collection fragile, as it relies on every call path correctly calling recordTable before addTableMetrics. To make this more robust, addTableMetrics should ensure the table is recorded itself. Calling recordTable within addTableMetrics is idempotent and will prevent metrics from being silently lost if a new code path is added in the future that doesn't explicitly call recordTable first.

func (s *dynamoRequestMetricsState) addTableMetrics(table string, returnedItems int, scannedItems int, writtenItems int) { if s == nil { return } table = strings.TrimSpace(table) if table == "" { return } s.recordTable(table) metrics := s.tables[table] metrics.ReturnedItems += returnedItems metrics.ScannedItems += scannedItems metrics.WrittenItems += writtenItems s.tables[table] = metrics }

bootjp · 2026-03-10T15:10:22Z

/gemini review

bootjp · 2026-03-10T15:21:31Z

/gemini review

bootjp · 2026-03-10T15:24:04Z

/gemini review

gemini-code-assist

Code Review

This is an excellent pull request that adds comprehensive observability features for DynamoDB and Raft using Prometheus and Grafana. The implementation is robust, featuring production-ready practices like metric cardinality management and security for the metrics endpoint. The code is well-structured, thoroughly tested, and accompanied by clear documentation and tooling. The addition of backward compatibility for legacy GSI snapshots is also a valuable improvement. I have one suggestion to improve resource handling in the demo application's startup sequence.

gemini-code-assist · 2026-03-10T15:52:13Z

cmd/server/demo.go

+	if err != nil {
+		return err
+	}


There's a potential resource leak in the run function's startup sequence. If this call to setupMetricsHTTPServer fails, the function returns an error, but previously created network listeners (e.g., dynamoL, redisL, grpcSock) are not closed. This can lead to file descriptor leaks.

To address this, you should ensure resources are cleaned up in case of partial startup failure. You could explicitly close the listeners in the error-handling blocks. A more robust solution would be to refactor this run function to adopt the resource management pattern used in main.go with its runtimeServerRunner and start... helper functions.

gemini-code-assist

Code Review

This is an excellent pull request that adds comprehensive observability for DynamoDB and Raft using Prometheus and Grafana. The implementation is very well-thought-out, especially with the careful handling of metric cardinality, which is crucial for a stable monitoring system. The new monitoring package is well-structured, and the addition of tests and documentation is thorough. I also appreciate the backward compatibility consideration for legacy GSI snapshots. I have one minor suggestion to remove a piece of redundant code.

gemini-code-assist

Code Review

This pull request introduces significant observability features by adding Prometheus metrics for both the DynamoDB-compatible API and the underlying Raft consensus layer. It also includes pre-built Grafana dashboards for visualization. The implementation is robust, with careful consideration for metric cardinality and a secure-by-default approach for the new metrics endpoint. Additionally, this PR addresses a compatibility issue by normalizing legacy GSI snapshot formats, ensuring backward compatibility. The changes are well-structured and include comprehensive tests. My review focuses on a couple of areas for improvement regarding code consistency and metric accuracy.

_{Note: Security Review is unavailable for this PR.}

gemini-code-assist · 2026-03-10T16:27:58Z

main.go

+func startMetricsServer(ctx context.Context, lc *net.ListenConfig, eg *errgroup.Group, metricsAddr string, metricsToken string, handler http.Handler) error {
+	if strings.TrimSpace(metricsAddr) == "" || handler == nil {
+		return nil
+	}
+	if monitoring.MetricsAddressRequiresToken(metricsAddr) && strings.TrimSpace(metricsToken) == "" {
+		return errors.New("metricsToken is required when metricsAddress is not loopback")
+	}
+	metricsL, err := lc.Listen(ctx, "tcp", metricsAddr)
+	if err != nil {
+		return errors.Wrapf(err, "failed to listen on %s", metricsAddr)
+	}
+	metricsServer := &http.Server{
+		Handler:           monitoring.ProtectHandler(handler, metricsToken),
+		ReadHeaderTimeout: time.Second,
+	}
+	eg.Go(func() error {
+		return serveMetricsUntilCanceled(ctx, metricsServer, metricsL, metricsAddr)
+	})
+	return nil
+}
+
+func serveMetricsUntilCanceled(ctx context.Context, server *http.Server, listener net.Listener, metricsAddr string) error {
+	stop := make(chan struct{})
+	go watchMetricsShutdown(ctx, server, stop, metricsAddr)
+
+	err := server.Serve(listener)
+	close(stop)
+	if errors.Is(err, http.ErrServerClosed) || errors.Is(err, net.ErrClosed) {
+		return nil
+	}
+	return errors.WithStack(err)
+}
+
+func watchMetricsShutdown(ctx context.Context, server *http.Server, stop <-chan struct{}, metricsAddr string) {
+	select {
+	case <-ctx.Done():
+		shutdownCtx, cancel := context.WithTimeout(context.WithoutCancel(ctx), metricsShutdownTimeout)
+		defer cancel()
+		if err := server.Shutdown(shutdownCtx); err != nil && !errors.Is(err, http.ErrServerClosed) && !errors.Is(err, net.ErrClosed) {
+			log.Printf("metrics server shutdown error on %s: %v", metricsAddr, err)
+		}
+	case <-stop:
+	}
+}


The logic for starting and stopping the metrics server here is implemented differently from the one in cmd/server/demo.go. The demo.go file uses errgroup tasks (metricsServeTask, metricsShutdownTask) which is cleaner and more consistent with how other services are managed in this application.

To improve code consistency and reduce duplication, I recommend refactoring this implementation to use the same task-based pattern as in cmd/server/demo.go. You could potentially move the shared task functions into the monitoring package to be reused by both main.go and demo.go.

References

Centralize common complex logic, such as server startup/shutdown tasks, into shared helper functions or packages to avoid code duplication and improve maintainability, similar to how retry/backoff mechanisms are centralized.

gemini-code-assist · 2026-03-10T16:27:58Z

monitoring/raft.go

+func parseLastContactSeconds(raw string) float64 {
+	raw = strings.TrimSpace(raw)
+	switch raw {
+	case "", "never":
+		return 0
+	case "0":
+		return 0
+	}
+	d, err := time.ParseDuration(raw)
+	if err != nil {
+		if _, loaded := loggedLastContactParseValues.LoadOrStore(raw, struct{}{}); !loaded {
+			slog.Warn("failed to parse raft last_contact metric", "raw", raw, "err", err)
+		}
+		return 0
+	}
+	return d.Seconds()
+}


The function parseLastContactSeconds currently returns 0 in cases of a parse error or when the value is "never". This can be misleading, as a value of 0 for elastickv_raft_last_contact_seconds typically implies that contact was made just now. This could mask issues with Raft stats reporting or lead to incorrect interpretations of the metric.

To make the metric clearer, I suggest returning a sentinel value like -1 for unknown states (e.g., "never" or parse errors). This makes it explicit that the value is not a valid duration, and the Grafana dashboard can be configured to handle this special case appropriately (e.g., by displaying 'N/A' or hiding the value).

Suggested change

func parseLastContactSeconds(raw string) float64 {

raw = strings.TrimSpace(raw)

switch raw {

case "", "never":

return 0

case "0":

return 0

}

d, err := time.ParseDuration(raw)

if err != nil {

if _, loaded := loggedLastContactParseValues.LoadOrStore(raw, struct{}{}); !loaded {

slog.Warn("failed to parse raft last_contact metric", "raw", raw, "err", err)

}

return 0

}

return d.Seconds()

}

func parseLastContactSeconds(raw string) float64 {

raw = strings.TrimSpace(raw)

switch raw {

case "", "never":

return -1 // Indicate unknown/never

case "0":

return 0

}

d, err := time.ParseDuration(raw)

if err != nil {

if _, loaded := loggedLastContactParseValues.LoadOrStore(raw, struct{}{}); !loaded {

slog.Warn("failed to parse raft last_contact metric", "raw", raw, "err", err)

}

return -1 // Indicate parse error

}

return d.Seconds()

}

bootjp · 2026-03-10T17:26:35Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces comprehensive observability for DynamoDB and Raft components using Prometheus and Grafana, which is a significant and valuable addition. The implementation is thorough, including metrics for request rates, latency, errors, and Raft state. It also includes a compatibility layer for legacy GSI snapshots. The code is well-structured, with a new monitoring package that cleanly separates concerns. The inclusion of tests, documentation updates, and pre-configured Grafana dashboards makes this a high-quality contribution. I have one minor suggestion for code simplification.

gemini-code-assist · 2026-03-10T17:44:02Z

adapter/dynamodb.go

+func (c *countingReadCloser) Read(p []byte) (int, error) {
+	if c == nil || c.ReadCloser == nil {
+		return 0, io.EOF
+	}
+	n, err := c.ReadCloser.Read(p)
+	c.bytesRead += n
+	if err == nil {
+		return n, nil
+	}
+	if errors.Is(err, io.EOF) {
+		return n, io.EOF
+	}
+	return n, errors.WithStack(err)
+}


This function's implementation is correct, but it can be simplified for better readability by reducing the number of conditional return paths.

func (c *countingReadCloser) Read(p []byte) (int, error) { if c == nil || c.ReadCloser == nil { return 0, io.EOF } n, err := c.ReadCloser.Read(p) c.bytesRead += n if err != nil && !errors.Is(err, io.EOF) { return n, errors.WithStack(err) } return n, err }

Add Prometheus/Grafana observability for DynamoDB and Raft and normal…

9753357

…ize legacy GSI snapshots for compatibility