Bug 2052556

Summary:	Metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" reporting incorrect value
Product:	Container Native Virtualization (CNV)	Reporter:	Satyajit Bulage <sbulage>
Component:	Virtualization	Assignee:	Shirly Radco <sradco>
Status:	CLOSED ERRATA	QA Contact:	Denys Shchedrivyi <dshchedr>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.9.3	CC:	acardace, fdeutsch, sgott, sradco, ycui
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	hco-bundle-registry-container-v4.12.0-330	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2053128 (view as bug list)		Environment:
Last Closed:	2023-01-24 13:36:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2053128

Description Satyajit Bulage 2022-02-09 14:47:33 UTC

Description of problem: Created several VM's and run metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" and saw that the value reported is corresponding to the no. of "virt-launcher" pods available on node.



Version-Release number of selected component (if applicable):4.10


How reproducible:100%


Steps to Reproduce:
1. Create 2-3 Virtual Machines.
2. Run/execute metrics "kubevirt_num_virt_handlers_by_node_running_virt_launcher"
3.

Actual results: Metrics showing values of virt-laucher pods


Expected results: It should show virt-handler pod values.


Additional info:

Comment 2 sgott 2022-02-16 13:19:08 UTC

Shirly, could you please clarify what this metric is intended to measure?

Comment 3 sgott 2022-02-16 13:20:49 UTC

*** Bug 2053128 has been marked as a duplicate of this bug. ***

Comment 4 Shirly Radco 2022-02-16 13:32:50 UTC

The "kubevirt_num_virt_handlers_by_node_running_virt_launcher" recording
rule is used for the "OrphanedVirtualMachineInstances" alert to find nodes that are running VMs, 
but they are missing virt-handler pod.

The following recording rule value is the number of virt-launcher pods and not the number of virt-handlers.

Comment 5 Shirly Radco 2022-02-23 12:47:58 UTC

I believe the correct recording rule, "kubevirt_num_virt_handlers_by_node_running_virt_launcher", should be:

((count by (node)(kube_pod_info{pod=~'virt-launcher-.*'} ) *0  ) + on (node) group_left(_blah) (count by (node)(group by(node,pod)(node_namespace_pod:kube_pod_info:{pod=~'virt-handler-.*'}) * on (pod) group_left(_blah) (1*group by (pod)(kube_pod_container_status_ready{pod=~'virt-handler-.*'} )==1) ) or vector(0) ) )

and the alert expression should be 

kubevirt_num_virt_handlers_by_node_running_virt_launcher >0

But it must be verified when
1. There is a node with no virt-handler pod and with at least 1 running vm
2. There is a node with virt-handler pod, but it is not in ready state and with at least 1 running vm

The value of the expression should be the number of virt-launcher pods running on the node.

Note: We should give this alert enough eval time, since it may take a while for the virt-launcher to become ready if it was terminated for some reason.

Comment 6 sgott 2022-02-23 13:11:15 UTC

*** Bug 2053128 has been marked as a duplicate of this bug. ***

Comment 7 Shirly Radco 2022-03-17 13:34:53 UTC

I'll need to rewrite the query so that it will return the node and a 0 or 1 value.
1 will indicate that the node has virt-launcher and virt=handler pods - Which is Good.
0 will indicate that the node has virt-launcher pods but no virt-handler pod - Which we should alert on.

We should remove the "pod" and "namespace" from the query results, since they are confusing and not required.

Comment 8 sgott 2022-03-23 12:59:31 UTC

@sradco per Comment #7 are you working on this?

Comment 9 Shirly Radco 2022-03-23 13:37:19 UTC

This is the updated query for the recording rule "kubevirt_num_virt_handlers_by_node_running_virt_launcher"

sum(count by(node) (node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"}) * on(node) group_left(pod) ( 1*(kube_pod_container_status_ready{pod=~"virt-handler-.*"} + on(pod) group_left(node) ( 0*node_namespace_pod:kube_pod_info:{pod=~"virt-handler-.*"})))*0 +1 or on(node) (0 * node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"})) without (pod,namespace,prometheus)

It will return the list of nodes that are running VMs. The value for each node will be 0 if there is no virt-handler pod on the node and 1 if there is virt-handler on the node.

Comment 10 Shirly Radco 2022-03-23 14:08:08 UTC

Created a PR for this issue https://github.com/kubevirt/kubevirt/pull/7434

Comment 11 Satyajit Bulage 2022-04-11 10:03:33 UTC

Hello Stu,
Is there any plan to backport this metrics fix to 4.10.1 ?
Thanks,
Satyajit.

Comment 12 Antonio Cardace 2022-05-06 08:50:51 UTC

@sradco Any updates on this? Do you need someone reviewing the upstream PR?

Comment 13 Denys Shchedrivyi 2022-06-24 18:15:32 UTC

I verified it with virt-controller-v4.11.0-95 and not sure the alert works correctly.

First of all, the current bugzilla topic contains non-relevant info, after latest changes as I understand we don't have metric kubevirt_num_virt_handlers_by_node_running_virt_launcher anymore, so the alert OrphanedVirtualMachineInstances is based on some prometheus rules directly.

I tried 3 scenarios and only in one scenario I was managed to fire the alert.


 Successfull scenario:
Create VM and remove all virt-handler pods (by removing virt-handler daemonset)
 Result: the alert was succesfully triggered 


 Two failed scenarios:
1) Create VM, update daemonset with wrong image and restart the virt-handler pod on the node where VM is running

 VM is running, virt-handler pod expectedly stuck in ImagePullBackOff state, but alert was not firing:

> $ oc get pod -o wide
>virt-launcher-vm-label-2sxvw   2/2     Running   0          10m   10.129.2.110   virt-den-411-dstv9-worker-0-wf446   <none>           1/1

> $ oc get pod -n openshift-cnv -o wide | grep handl
> virt-handler-qw7sz                                                0/1     Init:ImagePullBackOff   0             26s   10.129.2.112    virt-den-411-dstv9-worker-0-wf446   <none>           <none>


2) Create VM, update daemonset with wrong readinessProbe and restart the virt-handler pod
 VM is running, virt-handler pod expectedly stuck in non ready state, but alert still was not firing

> $ oc get pod -n openshift-cnv -o wide | grep handl
> virt-handler-nqsqv                                                0/1     Running   0             15m   10.129.2.115    virt-den-411-dstv9-worker-0-wf446   <none>           <none>


Based on that, I think our current implementation works incorrectly. 

Couple of my thoughts: our alert is based on this rule:
> ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0

The metric `kube_pod_status_ready{condition="true",pod=~"virt-handler.*"}` shows in output all existing virt-handler pods regardless their status. However, for Ready pods it shows Value=1. So, for example this request will show only active and ready pods: 
> kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} == 1


After some tries I think this rule might work:

> ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) == 0 or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0

It showes me the node where virt-launcher pod is running, but virt-handler is not in Ready state.

Comment 15 sgott 2022-06-28 13:05:54 UTC

Because this failed QA, it must be deferred from the current release because it is not considered a blocker for 4.11.0. Per Comment #11, there is interest in backporting this once it is resolved.

Comment 17 Denys Shchedrivyi 2022-10-17 17:32:57 UTC

Verified on CNV v4.12.0-599

Comment 21 errata-xmlrpc 2023-01-24 13:36:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408