1801573 – [4.4] 502 error for Prometheus API after the cluster running overnight

Bug 1801573 - [4.4] 502 error for Prometheus API after the cluster running overnight

Summary: [4.4] 502 error for Prometheus API after the cluster running overnight

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	apiserver-auth
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Stefan Schimanski
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1809253
TreeView+	depends on / blocked

Reported:	2020-02-11 09:02 UTC by Junqi Zhao
Modified:	2020-05-04 11:36 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1809253 (view as bug list)
Environment:
Last Closed:	2020-05-04 11:35:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
targets in the cluster (2.52 MB, text/plain) 2020-02-11 09:02 UTC, Junqi Zhao	no flags	Details
all monitoring UIs show "Application is not available" (51.91 KB, image/png) 2020-02-11 10:13 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift oauth-proxy pull 152	0	None	closed	Bug 1801573: Reload serving certs	2021-01-18 16:38:50 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:36:01 UTC

Description Junqi Zhao 2020-02-11 09:02:27 UTC

Created attachment 1662403 [details]
targets in the cluster

Description of problem:
UPI on bare metal cluster, cluster version 4.4.0-0.nightly-2020-02-10-013941, many users are running pods on this cluster, at the first day, it is healthy
in the beginning, but I see errors after it is running overnight
# oc -n openshift-monitoring -c prometheus-proxy logs prometheus-k8s-0 | tail
2020/02/11 08:46:34 server.go:3055: http: TLS handshake error from 10.128.2.1:37170: remote error: tls: unknown certificate authority
2020/02/11 08:46:34 server.go:3055: http: TLS handshake error from 10.131.0.1:51492: remote error: tls: unknown certificate authority
2020/02/11 08:46:39 server.go:3055: http: TLS handshake error from 10.131.0.1:51830: remote error: tls: unknown certificate authority
2020/02/11 08:46:39 server.go:3055: http: TLS handshake error from 10.128.2.1:37512: remote error: tls: unknown certificate authority
2020/02/11 08:46:41 server.go:3055: http: TLS handshake error from 10.128.2.28:55798: remote error: tls: bad certificate
2020/02/11 08:46:44 server.go:3055: http: TLS handshake error from 10.131.0.1:52162: remote error: tls: unknown certificate authority
2020/02/11 08:46:44 server.go:3055: http: TLS handshake error from 10.128.2.1:37858: remote error: tls: unknown certificate authority
2020/02/11 08:46:49 server.go:3055: http: TLS handshake error from 10.131.0.1:52498: remote error: tls: unknown certificate authority
2020/02/11 08:46:49 server.go:3055: http: TLS handshake error from 10.128.2.1:38192: remote error: tls: unknown certificate authority
2020/02/11 08:46:50 server.go:3055: http: TLS handshake error from 10.131.0.23:53334: remote error: tls: bad certificate

and 502 error when the console connects to prometheus API, and all monitoring UIs show "Application is not available"
# oc -n openshift-monitoring logs alertmanager-main-0 -c alertmanager-proxy | tail
2020/02/11 08:47:48 server.go:3055: http: TLS handshake error from 10.131.0.1:46346: remote error: tls: unknown certificate authority
2020/02/11 08:47:48 server.go:3055: http: TLS handshake error from 10.128.2.1:47788: remote error: tls: unknown certificate authority
2020/02/11 08:47:53 server.go:3055: http: TLS handshake error from 10.131.0.1:46684: remote error: tls: unknown certificate authority
2020/02/11 08:47:53 server.go:3055: http: TLS handshake error from 10.128.2.1:48124: remote error: tls: unknown certificate authority
2020/02/11 08:47:54 server.go:3055: http: TLS handshake error from 10.131.0.23:56282: remote error: tls: bad certificate
2020/02/11 08:47:54 server.go:3055: http: TLS handshake error from 10.128.2.28:49148: remote error: tls: bad certificate
2020/02/11 08:47:58 server.go:3055: http: TLS handshake error from 10.131.0.1:47018: remote error: tls: unknown certificate authority
2020/02/11 08:47:58 server.go:3055: http: TLS handshake error from 10.128.2.1:48484: remote error: tls: unknown certificate authority
2020/02/11 08:48:03 server.go:3055: http: TLS handshake error from 10.131.0.1:47362: remote error: tls: unknown certificate authority
2020/02/11 08:48:03 server.go:3055: http: TLS handshake error from 10.128.2.1:48814: remote error: tls: unknown certificate authority

FYI, 10.128.2.1/10.131.0.1 are not pods' IP
# oc get pod -A -o wide | grep 10.128.2.1
federator                                               fedemeter-api-6446f6cd-l9tzk                                      1/1     Terminating        0          17h     10.128.2.164   xjiang0210a-kzhnb-compute-0         <none>           <none>
openshift-csi-snapshot-controller-operator              csi-snapshot-controller-operator-bb8589456-nzf97                  1/1     Running            0          27h     10.128.2.17    xjiang0210a-kzhnb-compute-0         <none>           <none>
openshift-csi-snapshot-controller                       csi-snapshot-controller-56c55cdcb4-hbt7b                          1/1     Running            1          17h     10.128.2.159   xjiang0210a-kzhnb-compute-0         <none>           <none>
openshift-dns                                           dns-default-ss7qn                                                 2/2     Running            2          27h     10.128.2.14    xjiang0210a-kzhnb-compute-0         <none>           <none>
openshift-image-registry                                node-ca-7hhp4                                                     1/1     Running            0          27h     10.128.2.15    xjiang0210a-kzhnb-compute-0         <none>           <none>
openshift-marketplace                                   redhat-marketplace-6bfdcf7f7d-h48xl                               1/1     Running            0          15h     10.128.2.196   xjiang0210a-kzhnb-compute-0         <none>           <none>
openshift-marketplace                                   redhat-operators-54f5bcbb56-gpncr                                 1/1     Running            0          16h     10.128.2.183   xjiang0210a-kzhnb-compute-0         <none>           <none>
openshift-monitoring                                    alertmanager-main-1                                               3/3     Running            0          27h     10.128.2.18    xjiang0210a-kzhnb-compute-0         <none>           <none>

# oc get pod -A -o wide | grep 10.131.0.1
federation-system                                       kubefed-controller-manager-6d5f46d745-d2lcc                       1/1     Running            0          69m     10.131.0.119   xjiang0210a-kzhnb-compute-1         <none>           <none>
openshift-logging                                       elasticsearch-cdm-bex6983s-1-f86c9d4dc-7z48t                      2/2     Running            0          102m    10.131.0.108   xjiang0210a-kzhnb-compute-1         <none>           <none>
openshift-logging                                       fluentd-f5dfg                                                     1/1     Running            0          101m    10.131.0.109   xjiang0210a-kzhnb-compute-1         <none>           <none>
openshift-logging                                       kibana-777ffdc49f-qg46m                                           2/2     Running            0          23m     10.131.0.120   xjiang0210a-kzhnb-compute-1         <none>           <none>
openshift-marketplace                                   qe-app-registry-f7f56459d-vrhzg                                   1/1     Running            0          26h     10.131.0.16    xjiang0210a-kzhnb-compute-1         <none>           <none>
openshift-monitoring                                    alertmanager-main-0                                               3/3     Running            0          27h     10.131.0.13    xjiang0210a-kzhnb-compute-1         <none>           <none>
protest                                                 example-bld28qvnvr                                                1/1     Running            0          74m     10.131.0.118   xjiang0210a-kzhnb-compute-1         <none>           <none>
xiaowan                                                 example-hello-openshift-3-xjq6l                                   1/1     Running            0          88m     10.131.0.115   xjiang0210a-kzhnb-compute-1         <none>           <none>

# oc -n openshift-monitoring get pod -o wide
NAME                                           READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
alertmanager-main-0                            3/3     Running   0          27h     10.131.0.13   xjiang0210a-kzhnb-compute-1         <none>           <none>
alertmanager-main-1                            3/3     Running   0          27h     10.128.2.18   xjiang0210a-kzhnb-compute-0         <none>           <none>
alertmanager-main-2                            3/3     Running   3          27h     10.131.0.4    xjiang0210a-kzhnb-compute-1         <none>           <none>
cluster-monitoring-operator-5cc547ff75-vq4bc   1/1     Running   0          27h     10.129.0.15   xjiang0210a-kzhnb-control-plane-2   <none>           <none>
grafana-659f665879-cf7fj                       2/2     Running   2          27h     10.131.0.3    xjiang0210a-kzhnb-compute-1         <none>           <none>
kube-state-metrics-bd8f6d6cf-km7sg             3/3     Running   3          27h     10.128.2.5    xjiang0210a-kzhnb-compute-0         <none>           <none>
node-exporter-2b2tg                            2/2     Running   2          27h     10.0.97.55    xjiang0210a-kzhnb-compute-0         <none>           <none>
node-exporter-6nk5n                            2/2     Running   0          27h     10.0.98.82    xjiang0210a-kzhnb-control-plane-1   <none>           <none>
node-exporter-8bh6x                            2/2     Running   2          27h     10.0.98.126   xjiang0210a-kzhnb-compute-1         <none>           <none>
node-exporter-nxrgb                            2/2     Running   0          27h     10.0.98.41    xjiang0210a-kzhnb-control-plane-2   <none>           <none>
node-exporter-xq27m                            2/2     Running   0          27h     10.0.97.83    xjiang0210a-kzhnb-control-plane-0   <none>           <none>
openshift-state-metrics-cdfb76f97-rxwkx        3/3     Running   3          27h     10.128.2.9    xjiang0210a-kzhnb-compute-0         <none>           <none>
prometheus-adapter-6ff848c487-hpgxz            1/1     Running   0          158m    10.131.0.92   xjiang0210a-kzhnb-compute-1         <none>           <none>
prometheus-adapter-6ff848c487-v2hvd            1/1     Running   0          159m    10.128.3.70   xjiang0210a-kzhnb-compute-0         <none>           <none>
prometheus-k8s-0                               7/7     Running   1          26h     10.131.0.23   xjiang0210a-kzhnb-compute-1         <none>           <none>
prometheus-k8s-1                               7/7     Running   1          26h     10.128.2.28   xjiang0210a-kzhnb-compute-0         <none>           <none>
prometheus-operator-76c9574f55-trvvz           1/1     Running   0          26h     10.130.0.51   xjiang0210a-kzhnb-control-plane-0   <none>           <none>
telemeter-client-849dc78ccf-8ppcw              3/3     Running   0          6h58m   10.128.3.12   xjiang0210a-kzhnb-compute-0         <none>           <none>
thanos-querier-6655d66f66-l2kch                4/4     Running   0          26h     10.128.2.26   xjiang0210a-kzhnb-compute-0         <none>           <none>
thanos-querier-6655d66f66-rbxql                4/4     Running   0          26h     10.130.0.55   xjiang0210a-kzhnb-control-plane-0   <none>           <none>

Found in openshift-ingress pods' log, "remote error: tls: bad certificate", 10.131.0.23/10.128.2.28 are prometheus pod's IP
# oc -n openshift-ingress logs router-default-66bcdb6f69-9s86p | grep 10.131.0.23 | tail
2020/02/11 08:50:44 http: TLS handshake error from 10.131.0.23:50426: remote error: tls: bad certificate
2020/02/11 08:51:14 http: TLS handshake error from 10.131.0.23:52660: remote error: tls: bad certificate
2020/02/11 08:51:44 http: TLS handshake error from 10.131.0.23:54842: remote error: tls: bad certificate
2020/02/11 08:52:14 http: TLS handshake error from 10.131.0.23:56946: remote error: tls: bad certificate
2020/02/11 08:52:44 http: TLS handshake error from 10.131.0.23:58876: remote error: tls: bad certificate
2020/02/11 08:53:14 http: TLS handshake error from 10.131.0.23:60918: remote error: tls: bad certificate
2020/02/11 08:53:44 http: TLS handshake error from 10.131.0.23:34656: remote error: tls: bad certificate
2020/02/11 08:54:14 http: TLS handshake error from 10.131.0.23:36916: remote error: tls: bad certificate
2020/02/11 08:54:44 http: TLS handshake error from 10.131.0.23:39326: remote error: tls: bad certificate
2020/02/11 08:55:14 http: TLS handshake error from 10.131.0.23:41330: remote error: tls: bad certificate

# oc -n openshift-ingress logs router-default-66bcdb6f69-5tdrg | grep "10.128.2.28" | tail
2020/02/11 08:51:53 http: TLS handshake error from 10.128.2.28:55570: remote error: tls: bad certificate
2020/02/11 08:52:23 http: TLS handshake error from 10.128.2.28:57588: remote error: tls: bad certificate
2020/02/11 08:52:53 http: TLS handshake error from 10.128.2.28:59584: remote error: tls: bad certificate
2020/02/11 08:53:23 http: TLS handshake error from 10.128.2.28:33370: remote error: tls: bad certificate
2020/02/11 08:53:53 http: TLS handshake error from 10.128.2.28:35402: remote error: tls: bad certificate
2020/02/11 08:54:23 http: TLS handshake error from 10.128.2.28:37688: remote error: tls: bad certificate
2020/02/11 08:54:53 http: TLS handshake error from 10.128.2.28:39706: remote error: tls: bad certificate
2020/02/11 08:55:23 http: TLS handshake error from 10.128.2.28:41734: remote error: tls: bad certificate
2020/02/11 08:55:53 http: TLS handshake error from 10.128.2.28:43746: remote error: tls: bad certificate
2020/02/11 08:56:23 http: TLS handshake error from 10.128.2.28:45772: remote error: tls: bad certificate

also checked the targets by API, there are "x509: certificate signed by unknown authority" for 21 endpoints, see the attached file

Version-Release number of selected component (if applicable):
UPI on bare metal cluster, cluster version 4.4.0-0.nightly-2020-02-10-013941

How reproducible:
Not sure, it seems if many users use the cluster it will have the issue, but does not found it in AWS/GCP so far

Steps to Reproduce:
1. Let the cluster run for a long time
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Junqi Zhao 2020-02-11 10:13:18 UTC

Created attachment 1662429 [details]
all monitoring UIs show "Application is not available"

Comment 7 Junqi Zhao 2020-02-13 09:30:43 UTC

@Sergiusz and @Stanislav
do we have a better way to verify the bug, we don't always have the proper clusters to verify it

Comment 8 Sergiusz Urbaniak 2020-02-13 10:15:37 UTC

I am unfortunately not aware how to enforce rotation in service-ca-operator, maybe @stanislav knows.

Comment 9 Standa Laznicka 2020-02-18 12:56:47 UTC

Why don't you just remove the secret that has the cert/key pair for the oauth-proxy? It will get recreated with new values.

Comment 11 Standa Laznicka 2020-02-21 13:52:26 UTC

Did you also check that the serving cert changed?

Comment 12 Junqi Zhao 2020-02-21 14:27:51 UTC

(In reply to Standa Laznicka from comment #11)
> Did you also check that the serving cert changed?

checked, it is changed

Comment 14 errata-xmlrpc 2020-05-04 11:35:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.