-
Notifications
You must be signed in to change notification settings - Fork 95
Add troubleshooting of vm metrics missing #958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jian Wang <jian.wang@suse.com>
|
martindekov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good Jian, I like the exact sequence, left minor comment regarding a place where it's ambiguous what we wait for, but apart from that no other comments, thanks!
| rancher-monitoring-operator 0 35s | ||
| ``` | ||
|
|
||
| Wait a while |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be specific for what should the user wait? Here I assume we wait a while for the service account, but as I am reading it as if I am fixing it and not familiar with the issue I am confused what should I wait before proceeding with the deletion of all below
| https://github.com/harvester/harvester/issues/8565 | ||
| ## Cluster Metrics are available but Vitural Machine Metrics are missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Cluster Metrics are available but Vitural Machine Metrics are missing | |
| ## Harvester Dashboard stops reporting virtual machine metrics after upgrade |
| $ kubectl get servicemonitor -A | ||
| NAMESPACE NAME AGE | ||
| cattle-monitoring-system prometheus-kubevirt-rules 24s // is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out-of-box in 1.6.1 has other non-kubevirt service monitors:
$ kubectl get servicemonitor -A
NAMESPACE NAME AGE
cattle-monitoring-system prometheus-kubevirt-rules 24m
harvester-system service-monitor-cdi 25m
longhorn-system longhorn-prometheus-servicemonitor 25m
We can make it more precise with something like:
$ kubectl get servicemonitor -A -lprometheus.kubevirt.io=true
No resources found
| When the `rancher-monitoring` add-on is enabled, the cluster metrics are available but the virtual machine metrics are missing | ||
| The ServiceMonitor object `prometheus-kubevirt-rules/cattle-monitoring-system` is missing, when you create it manually, it is deleted quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The ServiceMonitor object `prometheus-kubevirt-rules/cattle-monitoring-system` is missing, when you create it manually, it is deleted quickly. | |
| The `prometheus-kubevirt-rules` ServiceMonitor is missing in the `cattle-monitoring-system` namespace. This object cannot be manually re-added as the KubeVirt operator will automatically delete it. |
|
|
||
| ### Root Cause | ||
|
|
||
| When `kubevirt` is newly installed or upgraded, `kubevirt` generates a new configmap object to store the configurations, the `ServiceMonitor` is enabled or disabled according to the existing of the configured `ServiceAccount` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When `kubevirt` is newly installed or upgraded, `kubevirt` generates a new configmap object to store the configurations, the `ServiceMonitor` is enabled or disabled according to the existing of the configured `ServiceAccount` object. | |
| When `kubevirt` is newly installed or upgraded, `kubevirt` generates a new configmap object to store the configurations. A race condition exists within the KubeVirt operator that may cause the ServiceMonitor configuration to not be included in the configmap if the `rancher-monitoring-operator` service account is missing in the `cattle-monitoring-system` namespace at the time of install or upgrade. |
|
|
||
| ### Workaround | ||
|
|
||
| 1. Confirm the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. Confirm the issue | |
| The workaround involves ensuring the `rancher-monitoring-operator` service account exists, removing orphaned configmaps and restarting the KubeVirt operator. | |
| 1. Confirm the issue |
|
|
||
| 3. Ensure the serviceaccount is existing | ||
|
|
||
| This object is created when Harvester cluster is installed and should not be removed. If it does not exist, create it manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This object is created when Harvester cluster is installed and should not be removed. If it does not exist, create it manually. | |
| This object is created when Harvester cluster is installed and should not be removed. If it does not exist, create it manually. Otherwise, skip ahead to step 4 below. |
| ### Issue Description | ||
| When the `rancher-monitoring` add-on is enabled, the cluster metrics are available but the virtual machine metrics are missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When the `rancher-monitoring` add-on is enabled, the cluster metrics are available but the virtual machine metrics are missing | |
| After an upgrade, the Harvester dashboard stops reporting virtual machine metrics. The cluster metrics remain available. Disabling and re-enabling the add-on does not resolve the problem. |
Problem:
Solution:
Add troubleshooting of vm metrics missing
Related Issue(s):
harvester/harvester#9674
Test plan:
Additional documentation or context