Bug 2039411 - Monitoring operator reports unavailable=true while one Prometheus pod is ready
Summary: Monitoring operator reports unavailable=true while one Prometheus pod is ready
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.12.0
Assignee: Sunil Thaha
QA Contact: hongyan li
Brian Burt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-11 16:54 UTC by Simon Pasquier
Modified: 2023-01-17 19:47 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Before this update, if the Cluster Monitoring Operator (CMO) failed to update Prometheus, the CMO did not verify whether a previous deployment was running and would report that cluster monitoring was unavailable even if one of the Prometheus pods was still running. With this update, the CMO now checks for running Prometheus pods in this situation and reports that cluster monitoring is unavailable only if no Prometheus pods are running. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2039411[*BZ#2039411*])
Clone Of:
Environment:
Last Closed: 2023-01-17 19:46:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1558 0 None open Bug 2043518: set degraded and available status based on Prometheus pod status 2022-02-16 00:36:54 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:47:09 UTC

Description Simon Pasquier 2022-01-11 16:54:09 UTC
Description of problem:
The monitoring operator always goes "available=false/degraded=true" when one of the reconciliation tasks fails. In some cases it should 

Version-Release number of selected component (if applicable):
4.10 and before

How reproducible:
Always

Steps to Reproduce:
1. Mark the node running prometheus-k8s-1 has not schedulable
2. Trigger a rollout of the pod (delete the pod?)
3. Wait for the monitoring operator to change its conditions.

Actual results:
The operator reports "available=false/degraded=true".

Expected results:
The operator should report "available=true/degraded=true" because the other prometheus pod is up and running.

Additional info:
https://coreos.slack.com/archives/C0VMT03S5/p1641918136031400

Comment 6 hongyan li 2022-08-25 07:58:15 UTC
Take the bug as the bug has same pr as https://bugzilla.redhat.com/show_bug.cgi?id=2043518

Comment 7 hongyan li 2022-08-25 08:08:16 UTC
Test with pr
for a 3 worknode cluster, taint 2 work nodes
% oc adm taint nodes <node-name> prometheus:NoSchedule
% oc -n openshift-monitoring get pod|grep prometheus-k8s
prometheus-k8s-0                                         6/6     Running   0          33m
prometheus-k8s-1                                         0/6     Pending   0          13m
% oc get co monitoring
NAME         VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.12.0-0.ci.test-2022-08-25-065719-ci-ln-pjcbf32-latest   True        False         True       95s     SomePodsNotReady: shard 0: pod prometheus-k8s-1: 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) had untolerated taint {prometheus: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) had no available volume zone, 4 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.

Comment 8 hongyan li 2022-10-20 02:30:13 UTC
Changed the bug as verified for the PR is tested and merged

Comment 11 errata-xmlrpc 2023-01-17 19:46:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.