Bug 2052556 - Metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" reporting incorrect value
Summary: Metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" reporting i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.9.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Shirly Radco
QA Contact: Denys Shchedrivyi
URL:
Whiteboard:
: 2053128 (view as bug list)
Depends On:
Blocks: 2053128
TreeView+ depends on / blocked
 
Reported: 2022-02-09 14:47 UTC by Satyajit Bulage
Modified: 2023-01-24 13:37 UTC (History)
5 users (show)

Fixed In Version: hco-bundle-registry-container-v4.12.0-330
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2053128 (view as bug list)
Environment:
Last Closed: 2023-01-24 13:36:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 7434 0 None Merged Drop recording rule for orphaned VMs alert and fix alert definition 2022-08-01 11:33:56 UTC
Github kubevirt kubevirt pull 8038 0 None Merged Update orphaned vm alert 2022-08-04 07:40:59 UTC
Red Hat Issue Tracker CNV-16313 0 None None None 2022-11-29 09:55:07 UTC
Red Hat Product Errata RHSA-2023:0408 0 None None None 2023-01-24 13:37:30 UTC

Description Satyajit Bulage 2022-02-09 14:47:33 UTC
Description of problem: Created several VM's and run metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" and saw that the value reported is corresponding to the no. of "virt-launcher" pods available on node.



Version-Release number of selected component (if applicable):4.10


How reproducible:100%


Steps to Reproduce:
1. Create 2-3 Virtual Machines.
2. Run/execute metrics "kubevirt_num_virt_handlers_by_node_running_virt_launcher"
3.

Actual results: Metrics showing values of virt-laucher pods


Expected results: It should show virt-handler pod values.


Additional info:

Comment 2 sgott 2022-02-16 13:19:08 UTC
Shirly, could you please clarify what this metric is intended to measure?

Comment 3 sgott 2022-02-16 13:20:49 UTC
*** Bug 2053128 has been marked as a duplicate of this bug. ***

Comment 4 Shirly Radco 2022-02-16 13:32:50 UTC
The "kubevirt_num_virt_handlers_by_node_running_virt_launcher" recording
rule is used for the "OrphanedVirtualMachineInstances" alert to find nodes that are running VMs, 
but they are missing virt-handler pod.

The following recording rule value is the number of virt-launcher pods and not the number of virt-handlers.

Comment 5 Shirly Radco 2022-02-23 12:47:58 UTC
I believe the correct recording rule, "kubevirt_num_virt_handlers_by_node_running_virt_launcher", should be:

((count by (node)(kube_pod_info{pod=~'virt-launcher-.*'} ) *0  ) + on (node) group_left(_blah) (count by (node)(group by(node,pod)(node_namespace_pod:kube_pod_info:{pod=~'virt-handler-.*'}) * on (pod) group_left(_blah) (1*group by (pod)(kube_pod_container_status_ready{pod=~'virt-handler-.*'} )==1) ) or vector(0) ) )

and the alert expression should be 

kubevirt_num_virt_handlers_by_node_running_virt_launcher >0

But it must be verified when
1. There is a node with no virt-handler pod and with at least 1 running vm
2. There is a node with virt-handler pod, but it is not in ready state and with at least 1 running vm

The value of the expression should be the number of virt-launcher pods running on the node.

Note: We should give this alert enough eval time, since it may take a while for the virt-launcher to become ready if it was terminated for some reason.

Comment 6 sgott 2022-02-23 13:11:15 UTC
*** Bug 2053128 has been marked as a duplicate of this bug. ***

Comment 7 Shirly Radco 2022-03-17 13:34:53 UTC
I'll need to rewrite the query so that it will return the node and a 0 or 1 value.
1 will indicate that the node has virt-launcher and virt=handler pods - Which is Good.
0 will indicate that the node has virt-launcher pods but no virt-handler pod - Which we should alert on.

We should remove the "pod" and "namespace" from the query results, since they are confusing and not required.

Comment 8 sgott 2022-03-23 12:59:31 UTC
@sradco per Comment #7 are you working on this?

Comment 9 Shirly Radco 2022-03-23 13:37:19 UTC
This is the updated query for the recording rule "kubevirt_num_virt_handlers_by_node_running_virt_launcher"

sum(count by(node) (node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"}) * on(node) group_left(pod) ( 1*(kube_pod_container_status_ready{pod=~"virt-handler-.*"} + on(pod) group_left(node) ( 0*node_namespace_pod:kube_pod_info:{pod=~"virt-handler-.*"})))*0 +1 or on(node) (0 * node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"})) without (pod,namespace,prometheus)

It will return the list of nodes that are running VMs. The value for each node will be 0 if there is no virt-handler pod on the node and 1 if there is virt-handler on the node.

Comment 10 Shirly Radco 2022-03-23 14:08:08 UTC
Created a PR for this issue https://github.com/kubevirt/kubevirt/pull/7434

Comment 11 Satyajit Bulage 2022-04-11 10:03:33 UTC
Hello Stu,
Is there any plan to backport this metrics fix to 4.10.1 ?
Thanks,
Satyajit.

Comment 12 Antonio Cardace 2022-05-06 08:50:51 UTC
@sradco Any updates on this? Do you need someone reviewing the upstream PR?

Comment 13 Denys Shchedrivyi 2022-06-24 18:15:32 UTC
I verified it with virt-controller-v4.11.0-95 and not sure the alert works correctly.

First of all, the current bugzilla topic contains non-relevant info, after latest changes as I understand we don't have metric kubevirt_num_virt_handlers_by_node_running_virt_launcher anymore, so the alert OrphanedVirtualMachineInstances is based on some prometheus rules directly.

I tried 3 scenarios and only in one scenario I was managed to fire the alert.


 Successfull scenario:
Create VM and remove all virt-handler pods (by removing virt-handler daemonset)
 Result: the alert was succesfully triggered 


 Two failed scenarios:
1) Create VM, update daemonset with wrong image and restart the virt-handler pod on the node where VM is running

 VM is running, virt-handler pod expectedly stuck in ImagePullBackOff state, but alert was not firing:

> $ oc get pod -o wide
>virt-launcher-vm-label-2sxvw   2/2     Running   0          10m   10.129.2.110   virt-den-411-dstv9-worker-0-wf446   <none>           1/1

> $ oc get pod -n openshift-cnv -o wide | grep handl
> virt-handler-qw7sz                                                0/1     Init:ImagePullBackOff   0             26s   10.129.2.112    virt-den-411-dstv9-worker-0-wf446   <none>           <none>


2) Create VM, update daemonset with wrong readinessProbe and restart the virt-handler pod
 VM is running, virt-handler pod expectedly stuck in non ready state, but alert still was not firing

> $ oc get pod -n openshift-cnv -o wide | grep handl
> virt-handler-nqsqv                                                0/1     Running   0             15m   10.129.2.115    virt-den-411-dstv9-worker-0-wf446   <none>           <none>


Based on that, I think our current implementation works incorrectly. 

Couple of my thoughts: our alert is based on this rule:
> ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0

The metric `kube_pod_status_ready{condition="true",pod=~"virt-handler.*"}` shows in output all existing virt-handler pods regardless their status. However, for Ready pods it shows Value=1. So, for example this request will show only active and ready pods: 
> kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} == 1


After some tries I think this rule might work:

> ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) == 0 or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0

It showes me the node where virt-launcher pod is running, but virt-handler is not in Ready state.

Comment 15 sgott 2022-06-28 13:05:54 UTC
Because this failed QA, it must be deferred from the current release because it is not considered a blocker for 4.11.0. Per Comment #11, there is interest in backporting this once it is resolved.

Comment 17 Denys Shchedrivyi 2022-10-17 17:32:57 UTC
Verified on CNV v4.12.0-599

Comment 21 errata-xmlrpc 2023-01-24 13:36:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408


Note You need to log in before you can comment on or make changes to this bug.