2052556 – Metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" reporting incorrect value

Bug 2052556 - Metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" reporting incorrect value

Summary: Metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" reporting i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.9.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Shirly Radco
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2053128 (view as bug list)
Depends On:
Blocks:	2053128
TreeView+	depends on / blocked

Reported:	2022-02-09 14:47 UTC by Satyajit Bulage
Modified:	2023-01-24 13:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.12.0-330
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2053128 (view as bug list)
Environment:
Last Closed:	2023-01-24 13:36:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 7434	None	Merged	Drop recording rule for orphaned VMs alert and fix alert definition	2022-08-01 11:33:56 UTC
Github	kubevirt kubevirt pull 8038	None	Merged	Update orphaned vm alert	2022-08-04 07:40:59 UTC
Red Hat Issue Tracker	CNV-16313	None	None	None	2022-11-29 09:55:07 UTC
Red Hat Product Errata	RHSA-2023:0408	None	None	None	2023-01-24 13:37:30 UTC

Description Satyajit Bulage 2022-02-09 14:47:33 UTC

Description of problem: Created several VM's and run metric "kubevirt_num_virt_handlers_by_node_running_virt_launcher" and saw that the value reported is corresponding to the no. of "virt-launcher" pods available on node.



Version-Release number of selected component (if applicable):4.10


How reproducible:100%


Steps to Reproduce:
1. Create 2-3 Virtual Machines.
2. Run/execute metrics "kubevirt_num_virt_handlers_by_node_running_virt_launcher"
3.

Actual results: Metrics showing values of virt-laucher pods


Expected results: It should show virt-handler pod values.


Additional info:

Comment 2 sgott 2022-02-16 13:19:08 UTC

Shirly, could you please clarify what this metric is intended to measure?

Comment 3 sgott 2022-02-16 13:20:49 UTC

*** Bug 2053128 has been marked as a duplicate of this bug. ***

Comment 4 Shirly Radco 2022-02-16 13:32:50 UTC

The "kubevirt_num_virt_handlers_by_node_running_virt_launcher" recording
rule is used for the "OrphanedVirtualMachineInstances" alert to find nodes that are running VMs, 
but they are missing virt-handler pod.

The following recording rule value is the number of virt-launcher pods and not the number of virt-handlers.

Comment 5 Shirly Radco 2022-02-23 12:47:58 UTC

I believe the correct recording rule, "kubevirt_num_virt_handlers_by_node_running_virt_launcher", should be:

((count by (node)(kube_pod_info{pod=~'virt-launcher-.*'} ) *0  ) + on (node) group_left(_blah) (count by (node)(group by(node,pod)(node_namespace_pod:kube_pod_info:{pod=~'virt-handler-.*'}) * on (pod) group_left(_blah) (1*group by (pod)(kube_pod_container_status_ready{pod=~'virt-handler-.*'} )==1) ) or vector(0) ) )

and the alert expression should be 

kubevirt_num_virt_handlers_by_node_running_virt_launcher >0

But it must be verified when
1. There is a node with no virt-handler pod and with at least 1 running vm
2. There is a node with virt-handler pod, but it is not in ready state and with at least 1 running vm

The value of the expression should be the number of virt-launcher pods running on the node.

Note: We should give this alert enough eval time, since it may take a while for the virt-launcher to become ready if it was terminated for some reason.

Comment 6 sgott 2022-02-23 13:11:15 UTC

*** Bug 2053128 has been marked as a duplicate of this bug. ***

Comment 7 Shirly Radco 2022-03-17 13:34:53 UTC

I'll need to rewrite the query so that it will return the node and a 0 or 1 value.
1 will indicate that the node has virt-launcher and virt=handler pods - Which is Good.
0 will indicate that the node has virt-launcher pods but no virt-handler pod - Which we should alert on.

We should remove the "pod" and "namespace" from the query results, since they are confusing and not required.

Comment 8 sgott 2022-03-23 12:59:31 UTC

@sradco per Comment #7 are you working on this?

Comment 9 Shirly Radco 2022-03-23 13:37:19 UTC

This is the updated query for the recording rule "kubevirt_num_virt_handlers_by_node_running_virt_launcher"

sum(count by(node) (node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"}) * on(node) group_left(pod) ( 1*(kube_pod_container_status_ready{pod=~"virt-handler-.*"} + on(pod) group_left(node) ( 0*node_namespace_pod:kube_pod_info:{pod=~"virt-handler-.*"})))*0 +1 or on(node) (0 * node_namespace_pod:kube_pod_info:{pod=~"virt-launcher-.*"})) without (pod,namespace,prometheus)

It will return the list of nodes that are running VMs. The value for each node will be 0 if there is no virt-handler pod on the node and 1 if there is virt-handler on the node.

Comment 10 Shirly Radco 2022-03-23 14:08:08 UTC

Created a PR for this issue https://github.com/kubevirt/kubevirt/pull/7434

Comment 11 Satyajit Bulage 2022-04-11 10:03:33 UTC

Hello Stu,
Is there any plan to backport this metrics fix to 4.10.1 ?
Thanks,
Satyajit.

Comment 12 Antonio Cardace 2022-05-06 08:50:51 UTC

@sradco Any updates on this? Do you need someone reviewing the upstream PR?

Comment 13 Denys Shchedrivyi 2022-06-24 18:15:32 UTC

I verified it with virt-controller-v4.11.0-95 and not sure the alert works correctly.

First of all, the current bugzilla topic contains non-relevant info, after latest changes as I understand we don't have metric kubevirt_num_virt_handlers_by_node_running_virt_launcher anymore, so the alert OrphanedVirtualMachineInstances is based on some prometheus rules directly.

I tried 3 scenarios and only in one scenario I was managed to fire the alert.


 Successfull scenario:
Create VM and remove all virt-handler pods (by removing virt-handler daemonset)
 Result: the alert was succesfully triggered 


 Two failed scenarios:
1) Create VM, update daemonset with wrong image and restart the virt-handler pod on the node where VM is running

 VM is running, virt-handler pod expectedly stuck in ImagePullBackOff state, but alert was not firing:

> $ oc get pod -o wide
>virt-launcher-vm-label-2sxvw   2/2     Running   0          10m   10.129.2.110   virt-den-411-dstv9-worker-0-wf446   <none>           1/1

> $ oc get pod -n openshift-cnv -o wide | grep handl
> virt-handler-qw7sz                                                0/1     Init:ImagePullBackOff   0             26s   10.129.2.112    virt-den-411-dstv9-worker-0-wf446   <none>           <none>


2) Create VM, update daemonset with wrong readinessProbe and restart the virt-handler pod
 VM is running, virt-handler pod expectedly stuck in non ready state, but alert still was not firing

> $ oc get pod -n openshift-cnv -o wide | grep handl
> virt-handler-nqsqv                                                0/1     Running   0             15m   10.129.2.115    virt-den-411-dstv9-worker-0-wf446   <none>           <none>


Based on that, I think our current implementation works incorrectly. 

Couple of my thoughts: our alert is based on this rule:
> ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0

The metric `kube_pod_status_ready{condition="true",pod=~"virt-handler.*"}` shows in output all existing virt-handler pods regardless their status. However, for Ready pods it shows Value=1. So, for example this request will show only active and ready pods: 
> kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} == 1


After some tries I think this rule might work:

> ((count by(node) (kube_pod_status_ready{condition="true",pod=~"virt-handler.*"} * on(pod) group_left(node) kube_pod_info{pod=~"virt-handler.*"})) == 0 or (count by(node) (kube_pod_info{pod=~"virt-launcher.*"}) * 0)) == 0

It showes me the node where virt-launcher pod is running, but virt-handler is not in Ready state.

Comment 15 sgott 2022-06-28 13:05:54 UTC

Because this failed QA, it must be deferred from the current release because it is not considered a blocker for 4.11.0. Per Comment #11, there is interest in backporting this once it is resolved.

Comment 17 Denys Shchedrivyi 2022-10-17 17:32:57 UTC

Verified on CNV v4.12.0-599

Comment 21 errata-xmlrpc 2023-01-24 13:36:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.12.0 Images security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:0408

Note You need to log in before you can comment on or make changes to this bug.