Bug 1935582

Summary:	prometheus liveness probes cause issues while replaying WAL
Product:	OpenShift Container Platform	Reporter:	Sergiusz Urbaniak <surbania>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	alegrand, anpicker, erooth, kahara, kakkoyun, lcosic, mas-hatada, mfuruta, pkrupa, rh-container, spasquie
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1935585 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:51:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1935585

Description Sergiusz Urbaniak 2021-03-05 08:24:05 UTC

Description of problem:
Prometheus, during WAL replay returns 503 http codes when any endpoint is invoked. This causes constant issues and potentially endless restart loops.

This has been reported upstream in https://github.com/prometheus-operator/prometheus-operator/issues/3391, fixed in v0.46.0 of prometheus-operator.

Comment 3 Junqi Zhao 2021-03-08 02:09:18 UTC

4.8 uses prometheus-operator 0.45
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-06-055252   True        False         148m    Cluster version is 4.8.0-0.nightly-2021-03-06-055252

# oc -n openshift-monitoring logs prometheus-operator-684947f46c-28gxl -c prometheus-operator
level=info ts=2021-03-07T23:43:09.721849275Z caller=main.go:233 msg="Starting Prometheus Operator" version="(version=0.45.0, branch=rhaos-4.8-rhel-8, revision=9d3e9a6)"

Comment 5 Masaki Hatada 2021-06-01 11:12:58 UTC

Dear Red Hat,

Does Red Hat have a plan to backport this fix to old versions?

Our customer got the same issue in OCP4.5. 4.8 is still not GA version so they cannot upgraded to it now.
How can we avoid this issue with old versions?

Currently only one prometheus instance is running on the customer's env since other one has got restarted repeatedly due to this issue.
If the same issue happened in both instances, cluster-monitoring becomes completely unavailable. It is very critical.

Best Regards,
Masaki Hatada

Comment 6 Simon Pasquier 2021-06-01 15:40:27 UTC

The bug fix has been backported to 4.6.22 (bug 1935586) and 4.7.2 (bug 1935585).

Comment 7 Masaki Hatada 2021-06-02 01:15:33 UTC

Dear Simon,

Thank you for your update.

> The bug fix has been backported to 4.6.22 (bug 1935586) and 4.7.2 (bug 1935585).

Our customer is using OCP4.5.
Of course, we will upgrade their cluster in future but it will take a time.
Please let us know if there is some workaround of this issue.

Best Regards,
Masaki Hatada

Comment 8 Simon Pasquier 2021-06-02 07:24:04 UTC

Unfortunately we have no workaround for 4.5.

Comment 9 Simon Pasquier 2021-06-02 07:25:30 UTC

clearing needinfo flag.

Comment 12 errata-xmlrpc 2021-07-27 22:51:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438