Bug 2039411

Summary:	Monitoring operator reports unavailable=true while one Prometheus pod is ready
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Sunil Thaha <sthaha>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	medium	Docs Contact:	Brian Burt <bburt>
Priority:	high
Version:	4.6	CC:	anpicker, aos-bugs, bburt, hongyli, juzhao, sthaha, wking
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	* Before this update, if the Cluster Monitoring Operator (CMO) failed to update Prometheus, the CMO did not verify whether a previous deployment was running and would report that cluster monitoring was unavailable even if one of the Prometheus pods was still running. With this update, the CMO now checks for running Prometheus pods in this situation and reports that cluster monitoring is unavailable only if no Prometheus pods are running. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2039411[BZ#2039411])	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:46:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon Pasquier 2022-01-11 16:54:09 UTC

Description of problem:
The monitoring operator always goes "available=false/degraded=true" when one of the reconciliation tasks fails. In some cases it should 

Version-Release number of selected component (if applicable):
4.10 and before

How reproducible:
Always

Steps to Reproduce:
1. Mark the node running prometheus-k8s-1 has not schedulable
2. Trigger a rollout of the pod (delete the pod?)
3. Wait for the monitoring operator to change its conditions.

Actual results:
The operator reports "available=false/degraded=true".

Expected results:
The operator should report "available=true/degraded=true" because the other prometheus pod is up and running.

Additional info:
https://coreos.slack.com/archives/C0VMT03S5/p1641918136031400

Comment 6 hongyan li 2022-08-25 07:58:15 UTC

Take the bug as the bug has same pr as https://bugzilla.redhat.com/show_bug.cgi?id=2043518

Comment 7 hongyan li 2022-08-25 08:08:16 UTC

Test with pr
for a 3 worknode cluster, taint 2 work nodes
% oc adm taint nodes <node-name> prometheus:NoSchedule
% oc -n openshift-monitoring get pod|grep prometheus-k8s
prometheus-k8s-0                                         6/6     Running   0          33m
prometheus-k8s-1                                         0/6     Pending   0          13m
% oc get co monitoring
NAME         VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.12.0-0.ci.test-2022-08-25-065719-ci-ln-pjcbf32-latest   True        False         True       95s     SomePodsNotReady: shard 0: pod prometheus-k8s-1: 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had untolerated taint {prometheus: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) had no available volume zone, 4 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

Comment 8 hongyan li 2022-10-20 02:30:13 UTC

Changed the bug as verified for the PR is tested and merged

Comment 11 errata-xmlrpc 2023-01-17 19:46:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399