Description of problem: Created several VM's and run metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" and saw that the value reported is corresponding to the no. of "virt-launcher" pods available on node. Version-Release number of selected component (if applicable):4.10 How reproducible:100% Steps to Reproduce: 1. Create 2-3 Virtual Machines. 2. Run/execute metrics "kubevirt_num_virt_handlers_by_node_running_virt_launcher" 3. Actual results: Metrics showing values of virt-laucher pods Expected results: It should show virt-handler pod values. Additional info:
Shirly, could you please clarify what this metric is intended to measure?
*** Bug 2053128 has been marked as a duplicate of this bug. ***
The "kubevirt_num_virt_handlers_by_node_running_virt_launcher" recording rule is used for the "OrphanedVirtualMachineInstances" alert to find nodes that are running VMs, but they are missing virt-handler pod. The following recording rule value is the number of virt-launcher pods and not the number of virt-handlers.
I believe the correct recording rule, "kubevirt_num_virt_handlers_by_node_running_virt_launcher", should be: ((count by (node)(kube_pod_info{pod=~'virt-launcher-.*'} ) *0 ) + on (node) group_left(_blah) (count by (node)(group by(node,pod)(node_namespace_pod:kube_pod_info:{pod=~'virt-handler-.*'}) * on (pod) group_left(_blah) (1*group by (pod)(kube_pod_container_status_ready{pod=~'virt-handler-.*'} )==1) ) or vector(0) ) ) and the alert expression should be kubevirt_num_virt_handlers_by_node_running_virt_launcher >0 But it must be verified when 1. There is a node with no virt-handler pod and with at least 1 running vm 2. There is a node with virt-handler pod, but it is not in ready state and with at least 1 running vm The value of the expression should be the number of virt-launcher pods running on the node. Note: We should give this alert enough eval time, since it may take a while for the virt-launcher to become ready if it was terminated for some reason.
I'll need to rewrite the query so that it will return the node and a 0 or 1 value. 1 will indicate that the node has virt-launcher and virt=handler pods - Which is Good. 0 will indicate that the node has virt-launcher pods but no virt-handler pod - Which we should alert on. We should remove the "pod" and "namespace" from the query results, since they are confusing and not required.
@sradco per Comment #7 are you working on this?
This is the updated query for the recording rule "kubevirt_num_virt_handlers_by_node_running_virt_launcher" sum(count by(node) (node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"}) * on(node) group_left(pod) ( 1*(kube_pod_container_status_ready{pod=~"virt-handler-.*"} + on(pod) group_left(node) ( 0*node_namespace_pod:kube_pod_info:{pod=~"virt-handler-.*"})))*0 +1 or on(node) (0 * node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"})) without (pod,namespace,prometheus) It will return the list of nodes that are running VMs. The value for each node will be 0 if there is no virt-handler pod on the node and 1 if there is virt-handler on the node.
Created a PR for this issue https://github.com/kubevirt/kubevirt/pull/7434
Hello Stu, Is there any plan to backport this metrics fix to 4.10.1 ? Thanks, Satyajit.
@sradco Any updates on this? Do you need someone reviewing the upstream PR?
I verified it with virt-controller-v4.11.0-95 and not sure the alert works correctly. First of all, the current bugzilla topic contains non-relevant info, after latest changes as I understand we don't have metric kubevirt_num_virt_handlers_by_node_running_virt_launcher anymore, so the alert OrphanedVirtualMachineInstances is based on some prometheus rules directly. I tried 3 scenarios and only in one scenario I was managed to fire the alert. Successfull scenario: Create VM and remove all virt-handler pods (by removing virt-handler daemonset) Result: the alert was succesfully triggered Two failed scenarios: 1) Create VM, update daemonset with wrong image and restart the virt-handler pod on the node where VM is running VM is running, virt-handler pod expectedly stuck in ImagePullBackOff state, but alert was not firing: > $ oc get pod -o wide >virt-launcher-vm-label-2sxvw 2/2 Running 0 10m 10.129.2.110 virt-den-411-dstv9-worker-0-wf446 <none> 1/1 > $ oc get pod -n openshift-cnv -o wide | grep handl > virt-handler-qw7sz 0/1 Init:ImagePullBackOff 0 26s 10.129.2.112 virt-den-411-dstv9-worker-0-wf446 <none> <none> 2) Create VM, update daemonset with wrong readinessProbe and restart the virt-handler pod VM is running, virt-handler pod expectedly stuck in non ready state, but alert still was not firing > $ oc get pod -n openshift-cnv -o wide | grep handl > virt-handler-nqsqv 0/1 Running 0 15m 10.129.2.115 virt-den-411-dstv9-worker-0-wf446 <none> <none> Based on that, I think our current implementation works incorrectly. Couple of my thoughts: our alert is based on this rule: > ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0 The metric `kube_pod_status_ready{condition="true",pod=~"virt-handler.*"}` shows in output all existing virt-handler pods regardless their status. However, for Ready pods it shows Value=1. So, for example this request will show only active and ready pods: > kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} == 1 After some tries I think this rule might work: > ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) == 0 or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0 It showes me the node where virt-launcher pod is running, but virt-handler is not in Ready state.
Because this failed QA, it must be deferred from the current release because it is not considered a blocker for 4.11.0. Per Comment #11, there is interest in backporting this once it is resolved.
Verified on CNV v4.12.0-599
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:0408