Bug 2048333

Summary: prometheus-adapter becomes inaccessible during rollout
Product: OpenShift Container Platform Reporter: Kazuhisa Hara <kahara>
Component: MonitoringAssignee: Joao Marcal <jmarcal>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: medium Docs Contact: Brian Burt <bburt>
Priority: low    
Version: 4.8CC: amuller, anpicker, aos-bugs, bburt, hongyli, mshen, sgordon, spasquie, wking
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Before this update, every-time prometheus-adapter was updated every 15 days caused a condition where data could not be obtained from prometheus-adapter while performing the rollout of the update. With this update, liveness and readiness probes were added to prometheus-adapter making it so the old pods are only deleted once the new pods are able to serve requests which resolves the issue.
Story Points: ---
Clone Of:
: 2099526 (view as bug list) Environment:
Last Closed: 2022-08-10 10:45:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2099526    

Description Kazuhisa Hara 2022-01-31 01:30:46 UTC
Description of problem:

prometheus-adapter becomes inaccessible during rollout.

Since prometheus-adapter is updated every 15 days[1], it will be rolled out every time.
No data can be obtained from prometheus-adapter while performing this rollout.

```
the server is currently unable to handle the request (get pods.metrics.k8s.io)
```

[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.8/pkg/operator/certrotationcontroller/certrotationcontroller.go#L131-L133

```
openshift-cluster-machine-approver                 machine-approver-58488dbb64-dlb4f                                        2/2     Running     0          26d
openshift-monitoring                               prometheus-adapter-7c878c4b69-mkhhd                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-adapter-7c878c4b69-rmcdk                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-k8s-0                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-k8s-1                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-operator-5f86995d86-dclzv                                     2/2     Running     3          26d
```



Version-Release number of selected component (if applicable):

    OCP 4.8

How reproducible:

    every time


Steps to Reproduce:
1. Try to get data from the prometheus-adapter when the rollout takes place
2.
3.

Actual results:

    Unable to get data temporarily.


Expected results:

    Since we have multiple prometheus-adapter, it is expected that data will get.


Additional info:

    This can be a rollout timing-dependent issue.
    Since the prometheus-adapter does not have any probes(readiness/startup), it is ready before the data is available and we can see such problems during that time.

Comment 1 Arunprasad Rajkumar 2022-01-31 07:58:16 UTC
I could reproduce the problem by simulating a rollout on prometheus-adapter deployment. As described the report, I believe absence of liveness/readiness probes are causing the loss of service during rollout.

Comment 7 hongyan li 2022-04-02 01:54:34 UTC
Issue still exists in payload 4.11.0-0.nightly-2022-04-01-172551

Comment 8 hongyan li 2022-04-06 02:16:16 UTC
In 4.11.0-0.nightly-2022-04-04-224437 and later payloads, no accepted payloads for now.

Comment 10 hongyan li 2022-04-07 02:42:18 UTC
Test with payload 4.11.0-0.nightly-2022-04-06-213816
Follow steps in  #c3

No issue now.

Comment 13 Simon Pasquier 2022-06-21 07:45:03 UTC
*** Bug 2099373 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-08-10 10:45:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069