Skip to content

Conversation

@lionakhnazarov
Copy link
Contributor

@lionakhnazarov lionakhnazarov commented Dec 17, 2025

The Keep Core node now exposes 31+ performance metrics via the /metrics endpoint (port 9601). These metrics provide comprehensive visibility into node operations, network health, and system performance.

Integrated Metrics by Category

1. DKG (Distributed Key Generation) Metrics (6 metrics)

Counters:

  • performance_dkg_joined_total - Total number of DKG joins (members joined)
  • performance_dkg_failed_total - Total number of failed DKG executions
  • performance_dkg_validation_total - Total number of DKG result validations performed
  • performance_dkg_challenges_submitted_total - Total number of DKG challenges submitted on-chain
  • performance_dkg_approvals_submitted_total - Total number of DKG approvals submitted on-chain

Duration Metrics:

  • performance_dkg_duration_seconds - Average duration of DKG operations
  • performance_dkg_duration_seconds_count - Total count of DKG operations

Performance Insights:

  • Success Rate: dkg_joined_total / (dkg_joined_total + dkg_failed_total) - Monitor DKG participation and success rates
  • Duration Monitoring: Alert if dkg_duration_seconds exceeds 300 seconds (5 minutes) - indicates slow DKG operations
  • On-chain Activity: Track dkg_challenges_submitted_total and dkg_approvals_submitted_total to monitor dispute resolution activity
  • Validation Rate: High dkg_validation_total relative to joins indicates active validation of DKG results

2. Signing Operations Metrics (5 metrics)

Counters:

  • performance_signing_operations_total - Total number of signing operations attempted
  • performance_signing_success_total - Total number of successful signing operations
  • performance_signing_failed_total - Total number of failed signing operations
  • performance_signing_timeouts_total - Total number of signing operations that timed out

Duration Metrics:

  • performance_signing_duration_seconds - Average duration of signing operations
  • performance_signing_duration_seconds_count - Total count of signing operations

Performance Insights:

  • Success Rate: signing_success_total / signing_operations_total - Critical metric for node reliability
  • Failure Rate: Alert if signing_failed_total rate > 10% of total operations
  • Timeout Rate: signing_timeouts_total / signing_operations_total - Indicates network or coordination issues
  • Performance: Alert if signing_duration_seconds exceeds 60 seconds - indicates slow signing operations
  • Throughput: Monitor signing_operations_total rate to understand signing workload

3. Wallet Dispatcher Metrics (6 metrics)

Counters:

  • performance_wallet_actions_total - Total number of wallet actions dispatched
  • performance_wallet_action_success_total - Total number of successfully completed wallet actions
  • performance_wallet_action_failed_total - Total number of failed wallet actions
  • performance_wallet_dispatcher_rejected_total - Total number of wallet actions rejected (wallet busy)
  • performance_wallet_heartbeat_failures_total - Total number of wallet heartbeat failures

Gauges:

  • performance_wallet_dispatcher_active_actions - Current number of wallets with active actions

Duration Metrics:

  • performance_wallet_action_duration_seconds - Average duration of wallet actions
  • performance_wallet_action_duration_seconds_count - Total count of wallet actions

Performance Insights:

  • Rejection Rate: wallet_dispatcher_rejected_total / wallet_actions_total - Alert if > 5% indicates wallet saturation
  • Success Rate: wallet_action_success_total / wallet_actions_total - Monitor wallet action reliability
  • Utilization: wallet_dispatcher_active_actions shows current wallet workload
  • Bottleneck Detection: High rejection rate + high active actions = wallet bottleneck
  • Health Monitoring: wallet_heartbeat_failures_total indicates wallet connectivity issues

4. Coordination Operations Metrics (4 metrics)

Counters:

  • performance_coordination_windows_detected_total - Total number of coordination windows detected
  • performance_coordination_procedures_executed_total - Total number of coordination procedures executed successfully
  • performance_coordination_failed_total - Total number of failed coordination procedures

Duration Metrics:

  • performance_coordination_duration_seconds - Average duration of coordination procedures
  • performance_coordination_duration_seconds_count - Total count of coordination procedures

Performance Insights:

  • Execution Rate: coordination_procedures_executed_total / coordination_windows_detected_total - Success rate of coordination
  • Failure Rate: Alert if coordination_failed_total rate > 5% of detected windows
  • Window Detection: Monitor coordination_windows_detected_total to understand coordination frequency
  • Performance: Track coordination_duration_seconds to identify slow coordination operations

5. Network Operations Metrics (10 metrics)

Peer Connection Metrics:

  • performance_peer_connections_total - Total number of peer connections established
  • performance_peer_disconnections_total - Total number of peer disconnections

Message Metrics:

  • performance_message_broadcast_total - Total number of messages broadcast to the network
  • performance_message_received_total - Total number of messages received from the network

Queue Size Metrics (Gauges):

  • performance_incoming_message_queue_size - Current size of incoming message queue (with channel label)
  • performance_message_handler_queue_size - Current size of message handler queues (with channel and handler labels)

Ping Test Metrics:

  • performance_ping_test_total - Total number of ping tests performed
  • performance_ping_test_success_total - Total number of successful ping tests
  • performance_ping_test_failed_total - Total number of failed ping tests
  • performance_ping_test_duration_seconds - Average duration of ping tests
  • performance_ping_test_duration_seconds_count - Total count of ping tests

Performance Insights:

  • Network Health: peer_connections_total vs peer_disconnections_total - Monitor connection stability
  • Message Throughput: Track message_broadcast_total and message_received_total rates
  • Queue Backlog: Alert if incoming_message_queue_size > 3000 (75% of 4096 capacity) - indicates message processing bottleneck
  • Handler Backlog: Alert if message_handler_queue_size > 400 (75% of 512 capacity) - indicates handler saturation
  • Network Latency: ping_test_duration_seconds shows network round-trip time
  • Connectivity: Alert if ping_test_failed_total rate > 10% of ping tests - indicates network issues
  • Message Balance: Compare broadcast vs received to detect message loss

- Introduced a new  system to monitor various operations within the Keep Core node, including wallet actions, DKG processes, signing operations, coordination procedures, and network activities.
- Metrics are recorded through a new interface, allowing for optional integration without impacting performance when disabled.
- Updated relevant components to wire in metrics recording, ensuring comprehensive coverage of critical operations.
- Added documentation detailing implemented metrics and their usage.

This enhancement provides better visibility into node performance and health, facilitating monitoring and troubleshooting.
@lionakhnazarov lionakhnazarov marked this pull request as ready for review December 31, 2025 18:43
- Introduced performance metrics for deposit and redemption process, including execution and proof submission metrics.
- Updated the .gitignore file to exclude new directories: data/, logs/, and storage/.
- Enhanced existing code to wire in metrics recording for redemption actions, improving visibility into redemption performance and potential bottlenecks.
- Added documentation outlining the new metrics and their implementation details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant