2039411 – Monitoring operator reports unavailable=true while one Prometheus pod is ready

Bug 2039411 - Monitoring operator reports unavailable=true while one Prometheus pod is ready

Summary: Monitoring operator reports unavailable=true while one Prometheus pod is ready

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Sunil Thaha
QA Contact:	hongyan li
Docs Contact:	Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-11 16:54 UTC by Simon Pasquier
Modified:	2023-01-17 19:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Before this update, if the Cluster Monitoring Operator (CMO) failed to update Prometheus, the CMO did not verify whether a previous deployment was running and would report that cluster monitoring was unavailable even if one of the Prometheus pods was still running. With this update, the CMO now checks for running Prometheus pods in this situation and reports that cluster monitoring is unavailable only if no Prometheus pods are running. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2039411[BZ#2039411])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:46:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1558	0	None	open	Bug 2043518: set degraded and available status based on Prometheus pod status	2022-02-16 00:36:54 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:47:09 UTC

Description Simon Pasquier 2022-01-11 16:54:09 UTC

Description of problem:
The monitoring operator always goes "available=false/degraded=true" when one of the reconciliation tasks fails. In some cases it should 

Version-Release number of selected component (if applicable):
4.10 and before

How reproducible:
Always

Steps to Reproduce:
1. Mark the node running prometheus-k8s-1 has not schedulable
2. Trigger a rollout of the pod (delete the pod?)
3. Wait for the monitoring operator to change its conditions.

Actual results:
The operator reports "available=false/degraded=true".

Expected results:
The operator should report "available=true/degraded=true" because the other prometheus pod is up and running.

Additional info:
https://coreos.slack.com/archives/C0VMT03S5/p1641918136031400

Comment 6 hongyan li 2022-08-25 07:58:15 UTC

Take the bug as the bug has same pr as https://bugzilla.redhat.com/show_bug.cgi?id=2043518

Comment 7 hongyan li 2022-08-25 08:08:16 UTC

Test with pr
for a 3 worknode cluster, taint 2 work nodes
% oc adm taint nodes <node-name> prometheus:NoSchedule
% oc -n openshift-monitoring get pod|grep prometheus-k8s
prometheus-k8s-0                                         6/6     Running   0          33m
prometheus-k8s-1                                         0/6     Pending   0          13m
% oc get co monitoring
NAME         VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.12.0-0.ci.test-2022-08-25-065719-ci-ln-pjcbf32-latest   True        False         True       95s     SomePodsNotReady: shard 0: pod prometheus-k8s-1: 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had untolerated taint {prometheus: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) had no available volume zone, 4 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

Comment 8 hongyan li 2022-10-20 02:30:13 UTC

Changed the bug as verified for the PR is tested and merged

Comment 11 errata-xmlrpc 2023-01-17 19:46:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.