Description of problem: Version: 4.10 There is no case when NoReadyVirtController or NoReadyVirtOperator are triggered. They are supposed to be fired when there is a virt-controller/virt-operator pod but it is not ready yet. Because of the alert definitions and metric implementations, `ready` metrics cannot be 0 and the alerts are not triggered.
I see we have additional issue with calculation of these metrics. I edited virt-controller deployment and set Readiness to some wrong value. As result I got 2 pods in non-ready state: virt-controller-dfd744474-68nx7 0/1 Running 0 5m12s virt-controller-dfd744474-drw5t 0/1 Running 0 5m12s With events in pods: Warning Unhealthy 6m9s (x11 over 7m39s) kubelet Readiness probe failed: Get "https://10.128.3.20:3/leader": dial tcp 10.128.3.20:3: connect: connection refused Warning ProbeError 2m49s (x33 over 7m39s) kubelet Readiness probe error: Get "https://10.128.3.20:3/leader": dial tcp 10.128.3.20:3: connect: connection refused but both of metrics kubevirt_virt_controller_up_total and kubevirt_virt_controller_ready_total are equal *2* (which erroneously means that both pods are running and ready) I guess, these alerts are also affected by current issue: LowReadyVirtControllersCount and LowReadyVirtOperatorsCount
Can you please elaborate on why do you think the definition and implementation are wrong? We are implementing a custom metric. This metric is reported when we are ready to do the work(before acquiring leadership lock). Therefore editing deployment and failing readiness probe will not affect this alert(see comment #1). I don't see a use case to support a custom readiness probe. What I see as a possible defect is a time it takes for us to fire this alert. We have set the time for 10minutes(see comment #1 where the test is running only for 5minutes). I would appreciate further clarification, thanks!
Lets for example take a look on alert NoReadyVirtOperator In our current implementation it is based on rule: > kubevirt_virt_operator_ready_total == 0 The metric kubevirt_virt_operator_ready_total is based on this rule: > sum(kubevirt_virt_operator_ready{namespace='%s'} The metric kubevirt_virt_operator_ready - is our custom metric, but the problem - it is *never equal to 0*. I mean when I have no virt-operator pods on a cluster - it is None (you can see in attached screenshot) Regarding to waiting time 10 minutes - the alert is changins it's status to Pending state as soon as rule is True. This timer is used only for changing state from Pending to Firing, so for testing purpose it might be enough to check if alert gets into Pending state.
Verified on cnv v4.11.0-387 (virt-operator v4.11.0-75)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6526