Bug 1935582 - prometheus liveness probes cause issues while replaying WAL
Summary: prometheus liveness probes cause issues while replaying WAL
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1935585
TreeView+ depends on / blocked
 
Reported: 2021-03-05 08:24 UTC by Sergiusz Urbaniak
Modified: 2021-07-27 22:52 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1935585 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:51:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github prometheus-operator prometheus-operator issues 3391 0 None closed WAL Recovery Restart Loop (EKS) 2021-03-05 08:31:54 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:52:05 UTC

Description Sergiusz Urbaniak 2021-03-05 08:24:05 UTC
Description of problem:
Prometheus, during WAL replay returns 503 http codes when any endpoint is invoked. This causes constant issues and potentially endless restart loops.

This has been reported upstream in https://github.com/prometheus-operator/prometheus-operator/issues/3391, fixed in v0.46.0 of prometheus-operator.

Comment 3 Junqi Zhao 2021-03-08 02:09:18 UTC
4.8 uses prometheus-operator 0.45
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-06-055252   True        False         148m    Cluster version is 4.8.0-0.nightly-2021-03-06-055252

# oc -n openshift-monitoring logs prometheus-operator-684947f46c-28gxl -c prometheus-operator
level=info ts=2021-03-07T23:43:09.721849275Z caller=main.go:233 msg="Starting Prometheus Operator" version="(version=0.45.0, branch=rhaos-4.8-rhel-8, revision=9d3e9a6)"

Comment 5 Masaki Hatada 2021-06-01 11:12:58 UTC
Dear Red Hat,

Does Red Hat have a plan to backport this fix to old versions?

Our customer got the same issue in OCP4.5. 4.8 is still not GA version so they cannot upgraded to it now.
How can we avoid this issue with old versions?

Currently only one prometheus instance is running on the customer's env since other one has got restarted repeatedly due to this issue.
If the same issue happened in both instances, cluster-monitoring becomes completely unavailable. It is very critical.

Best Regards,
Masaki Hatada

Comment 6 Simon Pasquier 2021-06-01 15:40:27 UTC
The bug fix has been backported to 4.6.22 (bug 1935586) and 4.7.2 (bug 1935585).

Comment 7 Masaki Hatada 2021-06-02 01:15:33 UTC
Dear Simon,

Thank you for your update.

> The bug fix has been backported to 4.6.22 (bug 1935586) and 4.7.2 (bug 1935585).

Our customer is using OCP4.5.
Of course, we will upgrade their cluster in future but it will take a time.
Please let us know if there is some workaround of this issue.

Best Regards,
Masaki Hatada

Comment 8 Simon Pasquier 2021-06-02 07:24:04 UTC
Unfortunately we have no workaround for 4.5.

Comment 9 Simon Pasquier 2021-06-02 07:25:30 UTC
clearing needinfo flag.

Comment 12 errata-xmlrpc 2021-07-27 22:51:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.