ENG-506 Add performance metrics tracking for key operations #3857
+2,034
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Keep Core node now exposes 31+ performance metrics via the
/metricsendpoint (port 9601). These metrics provide comprehensive visibility into node operations, network health, and system performance.Integrated Metrics by Category
1. DKG (Distributed Key Generation) Metrics (6 metrics)
Counters:
performance_dkg_joined_total- Total number of DKG joins (members joined)performance_dkg_failed_total- Total number of failed DKG executionsperformance_dkg_validation_total- Total number of DKG result validations performedperformance_dkg_challenges_submitted_total- Total number of DKG challenges submitted on-chainperformance_dkg_approvals_submitted_total- Total number of DKG approvals submitted on-chainDuration Metrics:
performance_dkg_duration_seconds- Average duration of DKG operationsperformance_dkg_duration_seconds_count- Total count of DKG operationsPerformance Insights:
dkg_joined_total / (dkg_joined_total + dkg_failed_total)- Monitor DKG participation and success ratesdkg_duration_secondsexceeds 300 seconds (5 minutes) - indicates slow DKG operationsdkg_challenges_submitted_totalanddkg_approvals_submitted_totalto monitor dispute resolution activitydkg_validation_totalrelative to joins indicates active validation of DKG results2. Signing Operations Metrics (5 metrics)
Counters:
performance_signing_operations_total- Total number of signing operations attemptedperformance_signing_success_total- Total number of successful signing operationsperformance_signing_failed_total- Total number of failed signing operationsperformance_signing_timeouts_total- Total number of signing operations that timed outDuration Metrics:
performance_signing_duration_seconds- Average duration of signing operationsperformance_signing_duration_seconds_count- Total count of signing operationsPerformance Insights:
signing_success_total / signing_operations_total- Critical metric for node reliabilitysigning_failed_totalrate > 10% of total operationssigning_timeouts_total / signing_operations_total- Indicates network or coordination issuessigning_duration_secondsexceeds 60 seconds - indicates slow signing operationssigning_operations_totalrate to understand signing workload3. Wallet Dispatcher Metrics (6 metrics)
Counters:
performance_wallet_actions_total- Total number of wallet actions dispatchedperformance_wallet_action_success_total- Total number of successfully completed wallet actionsperformance_wallet_action_failed_total- Total number of failed wallet actionsperformance_wallet_dispatcher_rejected_total- Total number of wallet actions rejected (wallet busy)performance_wallet_heartbeat_failures_total- Total number of wallet heartbeat failuresGauges:
performance_wallet_dispatcher_active_actions- Current number of wallets with active actionsDuration Metrics:
performance_wallet_action_duration_seconds- Average duration of wallet actionsperformance_wallet_action_duration_seconds_count- Total count of wallet actionsPerformance Insights:
wallet_dispatcher_rejected_total / wallet_actions_total- Alert if > 5% indicates wallet saturationwallet_action_success_total / wallet_actions_total- Monitor wallet action reliabilitywallet_dispatcher_active_actionsshows current wallet workloadwallet_heartbeat_failures_totalindicates wallet connectivity issues4. Coordination Operations Metrics (4 metrics)
Counters:
performance_coordination_windows_detected_total- Total number of coordination windows detectedperformance_coordination_procedures_executed_total- Total number of coordination procedures executed successfullyperformance_coordination_failed_total- Total number of failed coordination proceduresDuration Metrics:
performance_coordination_duration_seconds- Average duration of coordination proceduresperformance_coordination_duration_seconds_count- Total count of coordination proceduresPerformance Insights:
coordination_procedures_executed_total / coordination_windows_detected_total- Success rate of coordinationcoordination_failed_totalrate > 5% of detected windowscoordination_windows_detected_totalto understand coordination frequencycoordination_duration_secondsto identify slow coordination operations5. Network Operations Metrics (10 metrics)
Peer Connection Metrics:
performance_peer_connections_total- Total number of peer connections establishedperformance_peer_disconnections_total- Total number of peer disconnectionsMessage Metrics:
performance_message_broadcast_total- Total number of messages broadcast to the networkperformance_message_received_total- Total number of messages received from the networkQueue Size Metrics (Gauges):
performance_incoming_message_queue_size- Current size of incoming message queue (withchannellabel)performance_message_handler_queue_size- Current size of message handler queues (withchannelandhandlerlabels)Ping Test Metrics:
performance_ping_test_total- Total number of ping tests performedperformance_ping_test_success_total- Total number of successful ping testsperformance_ping_test_failed_total- Total number of failed ping testsperformance_ping_test_duration_seconds- Average duration of ping testsperformance_ping_test_duration_seconds_count- Total count of ping testsPerformance Insights:
peer_connections_totalvspeer_disconnections_total- Monitor connection stabilitymessage_broadcast_totalandmessage_received_totalratesincoming_message_queue_size> 3000 (75% of 4096 capacity) - indicates message processing bottleneckmessage_handler_queue_size> 400 (75% of 512 capacity) - indicates handler saturationping_test_duration_secondsshows network round-trip timeping_test_failed_totalrate > 10% of ping tests - indicates network issues