Created attachment 1662403 [details] targets in the cluster Description of problem: UPI on bare metal cluster, cluster version 4.4.0-0.nightly-2020-02-10-013941, many users are running pods on this cluster, at the first day, it is healthy in the beginning, but I see errors after it is running overnight # oc -n openshift-monitoring -c prometheus-proxy logs prometheus-k8s-0 | tail 2020/02/11 08:46:34 server.go:3055: http: TLS handshake error from 10.128.2.1:37170: remote error: tls: unknown certificate authority 2020/02/11 08:46:34 server.go:3055: http: TLS handshake error from 10.131.0.1:51492: remote error: tls: unknown certificate authority 2020/02/11 08:46:39 server.go:3055: http: TLS handshake error from 10.131.0.1:51830: remote error: tls: unknown certificate authority 2020/02/11 08:46:39 server.go:3055: http: TLS handshake error from 10.128.2.1:37512: remote error: tls: unknown certificate authority 2020/02/11 08:46:41 server.go:3055: http: TLS handshake error from 10.128.2.28:55798: remote error: tls: bad certificate 2020/02/11 08:46:44 server.go:3055: http: TLS handshake error from 10.131.0.1:52162: remote error: tls: unknown certificate authority 2020/02/11 08:46:44 server.go:3055: http: TLS handshake error from 10.128.2.1:37858: remote error: tls: unknown certificate authority 2020/02/11 08:46:49 server.go:3055: http: TLS handshake error from 10.131.0.1:52498: remote error: tls: unknown certificate authority 2020/02/11 08:46:49 server.go:3055: http: TLS handshake error from 10.128.2.1:38192: remote error: tls: unknown certificate authority 2020/02/11 08:46:50 server.go:3055: http: TLS handshake error from 10.131.0.23:53334: remote error: tls: bad certificate and 502 error when the console connects to prometheus API, and all monitoring UIs show "Application is not available" # oc -n openshift-monitoring logs alertmanager-main-0 -c alertmanager-proxy | tail 2020/02/11 08:47:48 server.go:3055: http: TLS handshake error from 10.131.0.1:46346: remote error: tls: unknown certificate authority 2020/02/11 08:47:48 server.go:3055: http: TLS handshake error from 10.128.2.1:47788: remote error: tls: unknown certificate authority 2020/02/11 08:47:53 server.go:3055: http: TLS handshake error from 10.131.0.1:46684: remote error: tls: unknown certificate authority 2020/02/11 08:47:53 server.go:3055: http: TLS handshake error from 10.128.2.1:48124: remote error: tls: unknown certificate authority 2020/02/11 08:47:54 server.go:3055: http: TLS handshake error from 10.131.0.23:56282: remote error: tls: bad certificate 2020/02/11 08:47:54 server.go:3055: http: TLS handshake error from 10.128.2.28:49148: remote error: tls: bad certificate 2020/02/11 08:47:58 server.go:3055: http: TLS handshake error from 10.131.0.1:47018: remote error: tls: unknown certificate authority 2020/02/11 08:47:58 server.go:3055: http: TLS handshake error from 10.128.2.1:48484: remote error: tls: unknown certificate authority 2020/02/11 08:48:03 server.go:3055: http: TLS handshake error from 10.131.0.1:47362: remote error: tls: unknown certificate authority 2020/02/11 08:48:03 server.go:3055: http: TLS handshake error from 10.128.2.1:48814: remote error: tls: unknown certificate authority FYI, 10.128.2.1/10.131.0.1 are not pods' IP # oc get pod -A -o wide | grep 10.128.2.1 federator fedemeter-api-6446f6cd-l9tzk 1/1 Terminating 0 17h 10.128.2.164 xjiang0210a-kzhnb-compute-0 <none> <none> openshift-csi-snapshot-controller-operator csi-snapshot-controller-operator-bb8589456-nzf97 1/1 Running 0 27h 10.128.2.17 xjiang0210a-kzhnb-compute-0 <none> <none> openshift-csi-snapshot-controller csi-snapshot-controller-56c55cdcb4-hbt7b 1/1 Running 1 17h 10.128.2.159 xjiang0210a-kzhnb-compute-0 <none> <none> openshift-dns dns-default-ss7qn 2/2 Running 2 27h 10.128.2.14 xjiang0210a-kzhnb-compute-0 <none> <none> openshift-image-registry node-ca-7hhp4 1/1 Running 0 27h 10.128.2.15 xjiang0210a-kzhnb-compute-0 <none> <none> openshift-marketplace redhat-marketplace-6bfdcf7f7d-h48xl 1/1 Running 0 15h 10.128.2.196 xjiang0210a-kzhnb-compute-0 <none> <none> openshift-marketplace redhat-operators-54f5bcbb56-gpncr 1/1 Running 0 16h 10.128.2.183 xjiang0210a-kzhnb-compute-0 <none> <none> openshift-monitoring alertmanager-main-1 3/3 Running 0 27h 10.128.2.18 xjiang0210a-kzhnb-compute-0 <none> <none> # oc get pod -A -o wide | grep 10.131.0.1 federation-system kubefed-controller-manager-6d5f46d745-d2lcc 1/1 Running 0 69m 10.131.0.119 xjiang0210a-kzhnb-compute-1 <none> <none> openshift-logging elasticsearch-cdm-bex6983s-1-f86c9d4dc-7z48t 2/2 Running 0 102m 10.131.0.108 xjiang0210a-kzhnb-compute-1 <none> <none> openshift-logging fluentd-f5dfg 1/1 Running 0 101m 10.131.0.109 xjiang0210a-kzhnb-compute-1 <none> <none> openshift-logging kibana-777ffdc49f-qg46m 2/2 Running 0 23m 10.131.0.120 xjiang0210a-kzhnb-compute-1 <none> <none> openshift-marketplace qe-app-registry-f7f56459d-vrhzg 1/1 Running 0 26h 10.131.0.16 xjiang0210a-kzhnb-compute-1 <none> <none> openshift-monitoring alertmanager-main-0 3/3 Running 0 27h 10.131.0.13 xjiang0210a-kzhnb-compute-1 <none> <none> protest example-bld28qvnvr 1/1 Running 0 74m 10.131.0.118 xjiang0210a-kzhnb-compute-1 <none> <none> xiaowan example-hello-openshift-3-xjq6l 1/1 Running 0 88m 10.131.0.115 xjiang0210a-kzhnb-compute-1 <none> <none> # oc -n openshift-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 3/3 Running 0 27h 10.131.0.13 xjiang0210a-kzhnb-compute-1 <none> <none> alertmanager-main-1 3/3 Running 0 27h 10.128.2.18 xjiang0210a-kzhnb-compute-0 <none> <none> alertmanager-main-2 3/3 Running 3 27h 10.131.0.4 xjiang0210a-kzhnb-compute-1 <none> <none> cluster-monitoring-operator-5cc547ff75-vq4bc 1/1 Running 0 27h 10.129.0.15 xjiang0210a-kzhnb-control-plane-2 <none> <none> grafana-659f665879-cf7fj 2/2 Running 2 27h 10.131.0.3 xjiang0210a-kzhnb-compute-1 <none> <none> kube-state-metrics-bd8f6d6cf-km7sg 3/3 Running 3 27h 10.128.2.5 xjiang0210a-kzhnb-compute-0 <none> <none> node-exporter-2b2tg 2/2 Running 2 27h 10.0.97.55 xjiang0210a-kzhnb-compute-0 <none> <none> node-exporter-6nk5n 2/2 Running 0 27h 10.0.98.82 xjiang0210a-kzhnb-control-plane-1 <none> <none> node-exporter-8bh6x 2/2 Running 2 27h 10.0.98.126 xjiang0210a-kzhnb-compute-1 <none> <none> node-exporter-nxrgb 2/2 Running 0 27h 10.0.98.41 xjiang0210a-kzhnb-control-plane-2 <none> <none> node-exporter-xq27m 2/2 Running 0 27h 10.0.97.83 xjiang0210a-kzhnb-control-plane-0 <none> <none> openshift-state-metrics-cdfb76f97-rxwkx 3/3 Running 3 27h 10.128.2.9 xjiang0210a-kzhnb-compute-0 <none> <none> prometheus-adapter-6ff848c487-hpgxz 1/1 Running 0 158m 10.131.0.92 xjiang0210a-kzhnb-compute-1 <none> <none> prometheus-adapter-6ff848c487-v2hvd 1/1 Running 0 159m 10.128.3.70 xjiang0210a-kzhnb-compute-0 <none> <none> prometheus-k8s-0 7/7 Running 1 26h 10.131.0.23 xjiang0210a-kzhnb-compute-1 <none> <none> prometheus-k8s-1 7/7 Running 1 26h 10.128.2.28 xjiang0210a-kzhnb-compute-0 <none> <none> prometheus-operator-76c9574f55-trvvz 1/1 Running 0 26h 10.130.0.51 xjiang0210a-kzhnb-control-plane-0 <none> <none> telemeter-client-849dc78ccf-8ppcw 3/3 Running 0 6h58m 10.128.3.12 xjiang0210a-kzhnb-compute-0 <none> <none> thanos-querier-6655d66f66-l2kch 4/4 Running 0 26h 10.128.2.26 xjiang0210a-kzhnb-compute-0 <none> <none> thanos-querier-6655d66f66-rbxql 4/4 Running 0 26h 10.130.0.55 xjiang0210a-kzhnb-control-plane-0 <none> <none> Found in openshift-ingress pods' log, "remote error: tls: bad certificate", 10.131.0.23/10.128.2.28 are prometheus pod's IP # oc -n openshift-ingress logs router-default-66bcdb6f69-9s86p | grep 10.131.0.23 | tail 2020/02/11 08:50:44 http: TLS handshake error from 10.131.0.23:50426: remote error: tls: bad certificate 2020/02/11 08:51:14 http: TLS handshake error from 10.131.0.23:52660: remote error: tls: bad certificate 2020/02/11 08:51:44 http: TLS handshake error from 10.131.0.23:54842: remote error: tls: bad certificate 2020/02/11 08:52:14 http: TLS handshake error from 10.131.0.23:56946: remote error: tls: bad certificate 2020/02/11 08:52:44 http: TLS handshake error from 10.131.0.23:58876: remote error: tls: bad certificate 2020/02/11 08:53:14 http: TLS handshake error from 10.131.0.23:60918: remote error: tls: bad certificate 2020/02/11 08:53:44 http: TLS handshake error from 10.131.0.23:34656: remote error: tls: bad certificate 2020/02/11 08:54:14 http: TLS handshake error from 10.131.0.23:36916: remote error: tls: bad certificate 2020/02/11 08:54:44 http: TLS handshake error from 10.131.0.23:39326: remote error: tls: bad certificate 2020/02/11 08:55:14 http: TLS handshake error from 10.131.0.23:41330: remote error: tls: bad certificate # oc -n openshift-ingress logs router-default-66bcdb6f69-5tdrg | grep "10.128.2.28" | tail 2020/02/11 08:51:53 http: TLS handshake error from 10.128.2.28:55570: remote error: tls: bad certificate 2020/02/11 08:52:23 http: TLS handshake error from 10.128.2.28:57588: remote error: tls: bad certificate 2020/02/11 08:52:53 http: TLS handshake error from 10.128.2.28:59584: remote error: tls: bad certificate 2020/02/11 08:53:23 http: TLS handshake error from 10.128.2.28:33370: remote error: tls: bad certificate 2020/02/11 08:53:53 http: TLS handshake error from 10.128.2.28:35402: remote error: tls: bad certificate 2020/02/11 08:54:23 http: TLS handshake error from 10.128.2.28:37688: remote error: tls: bad certificate 2020/02/11 08:54:53 http: TLS handshake error from 10.128.2.28:39706: remote error: tls: bad certificate 2020/02/11 08:55:23 http: TLS handshake error from 10.128.2.28:41734: remote error: tls: bad certificate 2020/02/11 08:55:53 http: TLS handshake error from 10.128.2.28:43746: remote error: tls: bad certificate 2020/02/11 08:56:23 http: TLS handshake error from 10.128.2.28:45772: remote error: tls: bad certificate also checked the targets by API, there are "x509: certificate signed by unknown authority" for 21 endpoints, see the attached file Version-Release number of selected component (if applicable): UPI on bare metal cluster, cluster version 4.4.0-0.nightly-2020-02-10-013941 How reproducible: Not sure, it seems if many users use the cluster it will have the issue, but does not found it in AWS/GCP so far Steps to Reproduce: 1. Let the cluster run for a long time 2. 3. Actual results: Expected results: Additional info:
Created attachment 1662429 [details] all monitoring UIs show "Application is not available"
@Sergiusz and @Stanislav do we have a better way to verify the bug, we don't always have the proper clusters to verify it
I am unfortunately not aware how to enforce rotation in service-ca-operator, maybe @stanislav knows.
Why don't you just remove the secret that has the cert/key pair for the oauth-proxy? It will get recreated with new values.
Did you also check that the serving cert changed?
(In reply to Standa Laznicka from comment #11) > Did you also check that the serving cert changed? checked, it is changed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581