2048333 – prometheus-adapter becomes inaccessible during rollout

Bug 2048333 - prometheus-adapter becomes inaccessible during rollout

Summary: prometheus-adapter becomes inaccessible during rollout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Joao Marcal
QA Contact:	hongyan li
Docs Contact:	Brian Burt
URL:
Whiteboard:
Duplicates (1):	2099373 (view as bug list)
Depends On:
Blocks:	2099526
TreeView+	depends on / blocked

Reported:	2022-01-31 01:30 UTC by Kazuhisa Hara
Modified:	2022-08-10 10:46 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Before this update, every-time prometheus-adapter was updated every 15 days caused a condition where data could not be obtained from prometheus-adapter while performing the rollout of the update. With this update, liveness and readiness probes were added to prometheus-adapter making it so the old pods are only deleted once the new pods are able to serve requests which resolves the issue.
Clone Of:
Clones:	2099526 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:45:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1621	None	Merged	Bug 2048333: [bot] Update jsonnet dependencies	2022-06-30 13:26:13 UTC
Github	prometheus-operator kube-prometheus pull 1696	None	Merged	Adds readinessProbe and livenessProbe to prometheus-adapter jsonnet	2022-03-30 16:17:52 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:46:22 UTC

Description Kazuhisa Hara 2022-01-31 01:30:46 UTC

Description of problem:

prometheus-adapter becomes inaccessible during rollout.

Since prometheus-adapter is updated every 15 days[1], it will be rolled out every time.
No data can be obtained from prometheus-adapter while performing this rollout.

```
the server is currently unable to handle the request (get pods.metrics.k8s.io)
```

[1] https://github.com/openshift/cluster-kube-apiserver-operator/blob/release-4.8/pkg/operator/certrotationcontroller/certrotationcontroller.go#L131-L133

```
openshift-cluster-machine-approver                 machine-approver-58488dbb64-dlb4f                                        2/2     Running     0          26d
openshift-monitoring                               prometheus-adapter-7c878c4b69-mkhhd                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-adapter-7c878c4b69-rmcdk                                      1/1     Running     0          11d <== here
openshift-monitoring                               prometheus-k8s-0                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-k8s-1                                                         7/7     Running     8          26d
openshift-monitoring                               prometheus-operator-5f86995d86-dclzv                                     2/2     Running     3          26d
```



Version-Release number of selected component (if applicable):

    OCP 4.8

How reproducible:

    every time


Steps to Reproduce:
1. Try to get data from the prometheus-adapter when the rollout takes place
2.
3.

Actual results:

    Unable to get data temporarily.


Expected results:

    Since we have multiple prometheus-adapter, it is expected that data will get.


Additional info:

    This can be a rollout timing-dependent issue.
    Since the prometheus-adapter does not have any probes(readiness/startup), it is ready before the data is available and we can see such problems during that time.

Comment 1 Arunprasad Rajkumar 2022-01-31 07:58:16 UTC

I could reproduce the problem by simulating a rollout on prometheus-adapter deployment. As described the report, I believe absence of liveness/readiness probes are causing the loss of service during rollout.

Comment 7 hongyan li 2022-04-02 01:54:34 UTC

Issue still exists in payload 4.11.0-0.nightly-2022-04-01-172551

Comment 8 hongyan li 2022-04-06 02:16:16 UTC

In 4.11.0-0.nightly-2022-04-04-224437 and later payloads, no accepted payloads for now.

Comment 10 hongyan li 2022-04-07 02:42:18 UTC

Test with payload 4.11.0-0.nightly-2022-04-06-213816
Follow steps in  #c3

No issue now.

Comment 13 Simon Pasquier 2022-06-21 07:45:03 UTC

*** Bug 2099373 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-08-10 10:45:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.