Bug 2048333 - prometheus-adapter becomes inaccessible during rollout
Summary: prometheus-adapter becomes inaccessible during rollout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.11.0
Assignee: Joao Marcal
QA Contact: hongyan li
Brian Burt
URL:
Whiteboard:
: 2099373 (view as bug list)
Depends On:
Blocks: 2099526
TreeView+ depends on / blocked
 
Reported: 2022-01-31 01:30 UTC by Kazuhisa Hara
Modified: 2022-08-10 10:46 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Before this update, every-time prometheus-adapter was updated every 15 days caused a condition where data could not be obtained from prometheus-adapter while performing the rollout of the update. With this update, liveness and readiness probes were added to prometheus-adapter making it so the old pods are only deleted once the new pods are able to serve requests which resolves the issue.
Clone Of:
: 2099526 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:45:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1621 0 None Merged Bug 2048333: [bot] Update jsonnet dependencies 2022-06-30 13:26:13 UTC
Github prometheus-operator kube-prometheus pull 1696 0 None Merged Adds readinessProbe and livenessProbe to prometheus-adapter jsonnet 2022-03-30 16:17:52 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:46:22 UTC

Description Kazuhisa Hara 2022-01-31 01:30:46 UTC
Description of problem:

prometheus-adapter becomes inaccessible during rollout.

Since prometheus-adapter is updated every 15 days[1], it will be rolled out every time.
No data can be obtained from prometheus-adapter while performing this rollout.

```
the server is currently unable to handle the request (get pods.metrics.k8s.io)
```

[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.8/pkg/operator/certrotationcontroller/certrotationcontroller.go#L131-L133

```
openshift-cluster-machine-approver                 machine-approver-58488dbb64-dlb4f                                        2/2     Running     0          26d
openshift-monitoring                               prometheus-adapter-7c878c4b69-mkhhd                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-adapter-7c878c4b69-rmcdk                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-k8s-0                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-k8s-1                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-operator-5f86995d86-dclzv                                     2/2     Running     3          26d
```



Version-Release number of selected component (if applicable):

    OCP 4.8

How reproducible:

    every time


Steps to Reproduce:
1. Try to get data from the prometheus-adapter when the rollout takes place
2.
3.

Actual results:

    Unable to get data temporarily.


Expected results:

    Since we have multiple prometheus-adapter, it is expected that data will get.


Additional info:

    This can be a rollout timing-dependent issue.
    Since the prometheus-adapter does not have any probes(readiness/startup), it is ready before the data is available and we can see such problems during that time.

Comment 1 Arunprasad Rajkumar 2022-01-31 07:58:16 UTC
I could reproduce the problem by simulating a rollout on prometheus-adapter deployment. As described the report, I believe absence of liveness/readiness probes are causing the loss of service during rollout.

Comment 7 hongyan li 2022-04-02 01:54:34 UTC
Issue still exists in payload 4.11.0-0.nightly-2022-04-01-172551

Comment 8 hongyan li 2022-04-06 02:16:16 UTC
In 4.11.0-0.nightly-2022-04-04-224437 and later payloads, no accepted payloads for now.

Comment 10 hongyan li 2022-04-07 02:42:18 UTC
Test with payload 4.11.0-0.nightly-2022-04-06-213816
Follow steps in  #c3

No issue now.

Comment 13 Simon Pasquier 2022-06-21 07:45:03 UTC
*** Bug 2099373 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-08-10 10:45:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.