2036676 – NoReadyVirtController and NoReadyVirtOperator are never triggered

Bug 2036676 - NoReadyVirtController and NoReadyVirtOperator are never triggered

Summary: NoReadyVirtController and NoReadyVirtOperator are never triggered

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	lpivarc
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-03 14:57 UTC by Erkan Erol
Modified:	2023-11-13 08:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:	hyperconverged-cluster-operator-v4.11.0-69 virt-operator-v4.11.0-71
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-14 19:28:30 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 7529	None	open	NoReadyVirtController and NoReadyVirtOperator are never triggered	2022-05-10 13:10:10 UTC
Red Hat Issue Tracker	CNV-15565	None	None	None	2023-11-13 08:16:47 UTC
Red Hat Product Errata	RHSA-2022:6526	None	None	None	2022-09-14 19:28:56 UTC

Description Erkan Erol 2022-01-03 14:57:16 UTC

Description of problem:


Version: 4.10


There is no case when NoReadyVirtController or NoReadyVirtOperator are triggered. They are supposed to be fired when there is a virt-controller/virt-operator pod but it is not ready yet. Because of the alert definitions and metric implementations, `ready` metrics cannot be 0 and the alerts are not triggered.

Comment 1 Denys Shchedrivyi 2022-01-13 18:16:27 UTC

I see we have additional issue with calculation of these metrics. I edited virt-controller deployment and set Readiness to some wrong value. As result I got 2 pods in non-ready state:

   virt-controller-dfd744474-68nx7                                 0/1     Running   0             5m12s
   virt-controller-dfd744474-drw5t                                 0/1     Running   0             5m12s

 With events in pods: 

   Warning  Unhealthy       6m9s (x11 over 7m39s)   kubelet            Readiness probe failed: Get "https://10.128.3.20:3/leader": dial tcp 10.128.3.20:3: connect: connection refused
   Warning  ProbeError      2m49s (x33 over 7m39s)  kubelet            Readiness probe error: Get "https://10.128.3.20:3/leader": dial tcp 10.128.3.20:3: connect: connection refused


but both of metrics kubevirt_virt_controller_up_total and kubevirt_virt_controller_ready_total are equal *2* (which erroneously means that both pods are running and ready)


I guess, these alerts are also affected by current issue: LowReadyVirtControllersCount and LowReadyVirtOperatorsCount

Comment 2 lpivarc 2022-02-18 18:05:09 UTC

Can you please elaborate on why do you think the definition and implementation are wrong?

We are implementing a custom metric. This metric is reported when we are ready to do the work(before acquiring leadership lock). Therefore editing deployment and failing readiness probe will not affect this alert(see comment #1). I don't see a use case to support a custom readiness probe.

What I see as a possible defect is a time it takes for us to fire this alert. We have set the time for 10minutes(see comment #1 where the test is running only for 5minutes).

I would appreciate further clarification, thanks!

Comment 3 Denys Shchedrivyi 2022-02-18 21:45:04 UTC

Lets for example take a look on alert NoReadyVirtOperator

In our current implementation it is based on rule: 

  > kubevirt_virt_operator_ready_total == 0

The metric kubevirt_virt_operator_ready_total is based on this rule:

  > sum(kubevirt_virt_operator_ready{namespace='%s'}

The metric kubevirt_virt_operator_ready - is our custom metric, but the problem - it is *never equal to 0*. I mean when I have no virt-operator pods on a cluster - it is None (you can see in attached screenshot)


 Regarding to waiting time 10 minutes - the alert is changins it's status to Pending state as soon as rule is True. This timer is used only for changing state from Pending to Firing, so for testing purpose it might be enough to check if alert gets into Pending state.

Comment 9 Denys Shchedrivyi 2022-05-27 21:02:35 UTC

Verified on cnv v4.11.0-387 (virt-operator v4.11.0-75)

Comment 11 errata-xmlrpc 2022-09-14 19:28:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526

Note You need to log in before you can comment on or make changes to this bug.