Bug 2052556
Summary: | Metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" reporting incorrect value | |||
---|---|---|---|---|
Product: | Container Native Virtualization (CNV) | Reporter: | Satyajit Bulage <sbulage> | |
Component: | Virtualization | Assignee: | Shirly Radco <sradco> | |
Status: | CLOSED ERRATA | QA Contact: | Denys Shchedrivyi <dshchedr> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.9.3 | CC: | acardace, fdeutsch, sgott, sradco, ycui | |
Target Milestone: | --- | |||
Target Release: | 4.12.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | hco-bundle-registry-container-v4.12.0-330 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2053128 (view as bug list) | Environment: | ||
Last Closed: | 2023-01-24 13:36:09 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2053128 |
Description
Satyajit Bulage
2022-02-09 14:47:33 UTC
Shirly, could you please clarify what this metric is intended to measure? *** Bug 2053128 has been marked as a duplicate of this bug. *** The "kubevirt_num_virt_handlers_by_node_running_virt_launcher" recording rule is used for the "OrphanedVirtualMachineInstances" alert to find nodes that are running VMs, but they are missing virt-handler pod. The following recording rule value is the number of virt-launcher pods and not the number of virt-handlers. I believe the correct recording rule, "kubevirt_num_virt_handlers_by_node_running_virt_launcher", should be: ((count by (node)(kube_pod_info{pod=~'virt-launcher-.*'} ) *0 ) + on (node) group_left(_blah) (count by (node)(group by(node,pod)(node_namespace_pod:kube_pod_info:{pod=~'virt-handler-.*'}) * on (pod) group_left(_blah) (1*group by (pod)(kube_pod_container_status_ready{pod=~'virt-handler-.*'} )==1) ) or vector(0) ) ) and the alert expression should be kubevirt_num_virt_handlers_by_node_running_virt_launcher >0 But it must be verified when 1. There is a node with no virt-handler pod and with at least 1 running vm 2. There is a node with virt-handler pod, but it is not in ready state and with at least 1 running vm The value of the expression should be the number of virt-launcher pods running on the node. Note: We should give this alert enough eval time, since it may take a while for the virt-launcher to become ready if it was terminated for some reason. *** Bug 2053128 has been marked as a duplicate of this bug. *** I'll need to rewrite the query so that it will return the node and a 0 or 1 value. 1 will indicate that the node has virt-launcher and virt=handler pods - Which is Good. 0 will indicate that the node has virt-launcher pods but no virt-handler pod - Which we should alert on. We should remove the "pod" and "namespace" from the query results, since they are confusing and not required. @sradco per Comment #7 are you working on this? This is the updated query for the recording rule "kubevirt_num_virt_handlers_by_node_running_virt_launcher" sum(count by(node) (node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"}) * on(node) group_left(pod) ( 1*(kube_pod_container_status_ready{pod=~"virt-handler-.*"} + on(pod) group_left(node) ( 0*node_namespace_pod:kube_pod_info:{pod=~"virt-handler-.*"})))*0 +1 or on(node) (0 * node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"})) without (pod,namespace,prometheus) It will return the list of nodes that are running VMs. The value for each node will be 0 if there is no virt-handler pod on the node and 1 if there is virt-handler on the node. Created a PR for this issue https://github.com/kubevirt/kubevirt/pull/7434 Hello Stu, Is there any plan to backport this metrics fix to 4.10.1 ? Thanks, Satyajit. @sradco Any updates on this? Do you need someone reviewing the upstream PR? I verified it with virt-controller-v4.11.0-95 and not sure the alert works correctly. First of all, the current bugzilla topic contains non-relevant info, after latest changes as I understand we don't have metric kubevirt_num_virt_handlers_by_node_running_virt_launcher anymore, so the alert OrphanedVirtualMachineInstances is based on some prometheus rules directly. I tried 3 scenarios and only in one scenario I was managed to fire the alert. Successfull scenario: Create VM and remove all virt-handler pods (by removing virt-handler daemonset) Result: the alert was succesfully triggered Two failed scenarios: 1) Create VM, update daemonset with wrong image and restart the virt-handler pod on the node where VM is running VM is running, virt-handler pod expectedly stuck in ImagePullBackOff state, but alert was not firing: > $ oc get pod -o wide >virt-launcher-vm-label-2sxvw 2/2 Running 0 10m 10.129.2.110 virt-den-411-dstv9-worker-0-wf446 <none> 1/1 > $ oc get pod -n openshift-cnv -o wide | grep handl > virt-handler-qw7sz 0/1 Init:ImagePullBackOff 0 26s 10.129.2.112 virt-den-411-dstv9-worker-0-wf446 <none> <none> 2) Create VM, update daemonset with wrong readinessProbe and restart the virt-handler pod VM is running, virt-handler pod expectedly stuck in non ready state, but alert still was not firing > $ oc get pod -n openshift-cnv -o wide | grep handl > virt-handler-nqsqv 0/1 Running 0 15m 10.129.2.115 virt-den-411-dstv9-worker-0-wf446 <none> <none> Based on that, I think our current implementation works incorrectly. Couple of my thoughts: our alert is based on this rule: > ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0 The metric `kube_pod_status_ready{condition="true",pod=~"virt-handler.*"}` shows in output all existing virt-handler pods regardless their status. However, for Ready pods it shows Value=1. So, for example this request will show only active and ready pods: > kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} == 1 After some tries I think this rule might work: > ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) == 0 or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0 It showes me the node where virt-launcher pod is running, but virt-handler is not in Ready state. Because this failed QA, it must be deferred from the current release because it is not considered a blocker for 4.11.0. Per Comment #11, there is interest in backporting this once it is resolved. Verified on CNV v4.12.0-599 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0408 |