Bug 2048333

Summary:	prometheus-adapter becomes inaccessible during rollout
Product:	OpenShift Container Platform	Reporter:	Kazuhisa Hara <kahara>
Component:	Monitoring	Assignee:	Joao Marcal <jmarcal>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	medium	Docs Contact:	Brian Burt <bburt>
Priority:	low
Version:	4.8	CC:	amuller, anpicker, aos-bugs, bburt, hongyli, mshen, sgordon, spasquie, wking
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Before this update, every-time prometheus-adapter was updated every 15 days caused a condition where data could not be obtained from prometheus-adapter while performing the rollout of the update. With this update, liveness and readiness probes were added to prometheus-adapter making it so the old pods are only deleted once the new pods are able to serve requests which resolves the issue.	Story Points:	---
Clone Of:
Clones:	2099526 (view as bug list)		Environment:
Last Closed:	2022-08-10 10:45:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2099526

Description Kazuhisa Hara 2022-01-31 01:30:46 UTC

Description of problem:

prometheus-adapter becomes inaccessible during rollout.

Since prometheus-adapter is updated every 15 days[1], it will be rolled out every time.
No data can be obtained from prometheus-adapter while performing this rollout.

```
the server is currently unable to handle the request (get pods.metrics.k8s.io)
```

[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.8/pkg/operator/certrotationcontroller/certrotationcontroller.go#L131-L133

```
openshift-cluster-machine-approver                 machine-approver-58488dbb64-dlb4f                                        2/2     Running     0          26d
openshift-monitoring                               prometheus-adapter-7c878c4b69-mkhhd                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-adapter-7c878c4b69-rmcdk                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-k8s-0                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-k8s-1                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-operator-5f86995d86-dclzv                                     2/2     Running     3          26d
```



Version-Release number of selected component (if applicable):

    OCP 4.8

How reproducible:

    every time


Steps to Reproduce:
1. Try to get data from the prometheus-adapter when the rollout takes place
2.
3.

Actual results:

    Unable to get data temporarily.


Expected results:

    Since we have multiple prometheus-adapter, it is expected that data will get.


Additional info:

    This can be a rollout timing-dependent issue.
    Since the prometheus-adapter does not have any probes(readiness/startup), it is ready before the data is available and we can see such problems during that time.

Comment 1 Arunprasad Rajkumar 2022-01-31 07:58:16 UTC

I could reproduce the problem by simulating a rollout on prometheus-adapter deployment. As described the report, I believe absence of liveness/readiness probes are causing the loss of service during rollout.

Comment 7 hongyan li 2022-04-02 01:54:34 UTC

Issue still exists in payload 4.11.0-0.nightly-2022-04-01-172551

Comment 8 hongyan li 2022-04-06 02:16:16 UTC

In 4.11.0-0.nightly-2022-04-04-224437 and later payloads, no accepted payloads for now.

Comment 10 hongyan li 2022-04-07 02:42:18 UTC

Test with payload 4.11.0-0.nightly-2022-04-06-213816
Follow steps in  #c3

No issue now.

Comment 13 Simon Pasquier 2022-06-21 07:45:03 UTC

*** Bug 2099373 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-08-10 10:45:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069